Skip to content

refactor: add infra-manager CLI for cluster lifecycle#465

Merged
Ronkahn21 merged 22 commits into
ai-dynamo:mainfrom
Ronkahn21:refactor/infra-manager
Mar 12, 2026
Merged

refactor: add infra-manager CLI for cluster lifecycle#465
Ronkahn21 merged 22 commits into
ai-dynamo:mainfrom
Ronkahn21:refactor/infra-manager

Conversation

@Ronkahn21

@Ronkahn21 Ronkahn21 commented Mar 2, 2026

Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind feature

What this PR does / why we need it:

Adds a unified infra-manager.py CLI (Typer-based) for Grove e2e cluster lifecycle management — cluster creation, component installation (k3d, KWOK, kai-scheduler, Pyroscope, Grove operator), and teardown. Replaces scattered shell/Python scripts with a modular Python package under operator/hack/infra_manager/.

Also adds Grafana/Pyroscope scrape annotations to helm overrides when profiling is enabled, so Pyroscope auto-discovers the pprof endpoint. Updates Makefile with new e2e targets (e2e-cluster-up, e2e-cluster-down, run-e2e-full, scale-cluster-up, scale-cluster-down).

Which issue(s) this PR fixes:

Part of #424

Special notes for your reviewer:

Depends on #464 for the Helm chart annotations support.

Does this PR introduce a API change?

NONE

Additional documentation e.g., enhancement proposals, usage docs, etc.

See operator/hack/README.md for CLI usage.

Comment thread operator/hack/requirements.txt Outdated
Comment thread operator/Makefile Outdated
Comment thread operator/hack/infra_manager/config.py Outdated
Comment thread operator/hack/infra_manager/commands/setup_cmd.py Outdated
shayasoolin
shayasoolin previously approved these changes Mar 9, 2026
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
When profiling is enabled, infra-manager now passes
Grafana/Pyroscope scrape annotations for cpu, memory,
and goroutine profiles to the Grove helm chart.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Revert Makefile and e2e script modifications, and remove
e2e-cluster requirements.txt added alongside the infra-manager
CLI but unrelated to it.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…ions

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…ter and renaming preset files

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…on prefix

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Replace split two-level config system (K3dConfig/ComponentConfig/KwokConfig
+ ClusterSetup/GroveSetup/.config wrappers) with flat per-component models:
ClusterConfig, SchedulerConfig/KaiConfig, GroveConfig, KwokConfig, PyroscopeConfig.

Add Helm-style -f/--set overlay support to load_setup_config.
Update e2e.yaml and scale.yaml to new flat structure.

Env var paths change: E2E_CLUSTER__CONFIG__WORKER_NODES -> E2E_CLUSTER__WORKER_NODES,
E2E_KAI_VERSION -> E2E_SCHEDULER__KAI__VERSION, etc.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…y support

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
- grove.py: use truthy check for cluster_cfg.registry (handles empty string)
- cluster.py: add docstring to _attempt inner function
- utils.py: improve collect_grove_helm_overrides docstring to mention annotation overrides
- config.py: fix type annotation result: str|dict, add dotenv exclusion comment
- tests: add base mutation assertion, use distinct --set value in env var priority test

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…I options

- Introduce new `qps` and `burst` fields in GroveConfig for Kubernetes client rate limiting.
- Add corresponding helm override keys: `config.runtimeClientConnection.qps` and `config.runtimeClientConnection.burst`.
- Implement unit tests for `collect_grove_helm_overrides` and GroveConfig field parsing.
- Replace deprecated CLI flags with `--set` overrides for greater flexibility.
- Update documentation, test cases, and preset file paths for consistency.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…d update skaffold env handling

- Replace `DEPENDENCIES` constant with `dep_value` function for improved flexibility in version handling.
- Update Grove deployment to pass `CONTAINER_REGISTRY` via `_env` instead of modifying `os.environ`.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…tory structure

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Encode helm flag alongside each value at the source so consumers
don't need to know which values require string coercion. Annotation
overrides now carry --set-string, preventing helm from coercing
"true"/port numbers to booleans/integers that Kubernetes rejects.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
@Ronkahn21 Ronkahn21 force-pushed the refactor/infra-manager branch from 513b9fe to 0779f77 Compare March 11, 2026 15:50
@Ronkahn21 Ronkahn21 merged commit 3457783 into ai-dynamo:main Mar 12, 2026
11 checks passed
enoodle pushed a commit to enoodle/grove that referenced this pull request Mar 24, 2026
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants