refactor: add infra-manager CLI for cluster lifecycle#465
Merged
Conversation
f588f97 to
7020d9c
Compare
gflarity
reviewed
Mar 6, 2026
shayasoolin
reviewed
Mar 8, 2026
shayasoolin
previously approved these changes
Mar 9, 2026
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
When profiling is enabled, infra-manager now passes Grafana/Pyroscope scrape annotations for cpu, memory, and goroutine profiles to the Grove helm chart. Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Revert Makefile and e2e script modifications, and remove e2e-cluster requirements.txt added alongside the infra-manager CLI but unrelated to it. Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…ions Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…ter and renaming preset files Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…on prefix Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Replace split two-level config system (K3dConfig/ComponentConfig/KwokConfig + ClusterSetup/GroveSetup/.config wrappers) with flat per-component models: ClusterConfig, SchedulerConfig/KaiConfig, GroveConfig, KwokConfig, PyroscopeConfig. Add Helm-style -f/--set overlay support to load_setup_config. Update e2e.yaml and scale.yaml to new flat structure. Env var paths change: E2E_CLUSTER__CONFIG__WORKER_NODES -> E2E_CLUSTER__WORKER_NODES, E2E_KAI_VERSION -> E2E_SCHEDULER__KAI__VERSION, etc. Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…y support Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
- grove.py: use truthy check for cluster_cfg.registry (handles empty string) - cluster.py: add docstring to _attempt inner function - utils.py: improve collect_grove_helm_overrides docstring to mention annotation overrides - config.py: fix type annotation result: str|dict, add dotenv exclusion comment - tests: add base mutation assertion, use distinct --set value in env var priority test Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…I options - Introduce new `qps` and `burst` fields in GroveConfig for Kubernetes client rate limiting. - Add corresponding helm override keys: `config.runtimeClientConnection.qps` and `config.runtimeClientConnection.burst`. - Implement unit tests for `collect_grove_helm_overrides` and GroveConfig field parsing. - Replace deprecated CLI flags with `--set` overrides for greater flexibility. - Update documentation, test cases, and preset file paths for consistency. Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…d update skaffold env handling - Replace `DEPENDENCIES` constant with `dep_value` function for improved flexibility in version handling. - Update Grove deployment to pass `CONTAINER_REGISTRY` via `_env` instead of modifying `os.environ`. Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…tory structure Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Encode helm flag alongside each value at the source so consumers don't need to know which values require string coercion. Annotation overrides now carry --set-string, preventing helm from coercing "true"/port numbers to booleans/integers that Kubernetes rejects. Signed-off-by: Ron Kahn <rkahn@nvidia.com>
513b9fe to
0779f77
Compare
shayasoolin
approved these changes
Mar 11, 2026
enoodle
approved these changes
Mar 12, 2026
gflarity
approved these changes
Mar 12, 2026
enoodle
pushed a commit
to enoodle/grove
that referenced
this pull request
Mar 24, 2026
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What type of PR is this?
/kind feature
What this PR does / why we need it:
Adds a unified
infra-manager.pyCLI (Typer-based) for Grove e2e cluster lifecycle management — cluster creation, component installation (k3d, KWOK, kai-scheduler, Pyroscope, Grove operator), and teardown. Replaces scattered shell/Python scripts with a modular Python package underoperator/hack/infra_manager/.Also adds Grafana/Pyroscope scrape annotations to helm overrides when profiling is enabled, so Pyroscope auto-discovers the pprof endpoint. Updates Makefile with new e2e targets (
e2e-cluster-up,e2e-cluster-down,run-e2e-full,scale-cluster-up,scale-cluster-down).Which issue(s) this PR fixes:
Part of #424
Special notes for your reviewer:
Depends on #464 for the Helm chart annotations support.
Does this PR introduce a API change?
Additional documentation e.g., enhancement proposals, usage docs, etc.