Skip to content

fix(bundler): clean up orphaned KAI and Kubeflow Trainer CRDs on undeploy#416

Merged
yuanchen8911 merged 2 commits intoNVIDIA:mainfrom
yuanchen8911:fix/undeploy-orphaned-crds
Mar 16, 2026
Merged

fix(bundler): clean up orphaned KAI and Kubeflow Trainer CRDs on undeploy#416
yuanchen8911 merged 2 commits intoNVIDIA:mainfrom
yuanchen8911:fix/undeploy-orphaned-crds

Conversation

@yuanchen8911
Copy link
Copy Markdown
Contributor

Summary

Fix undeploy.sh to clean up orphaned CRDs from KAI scheduler and Kubeflow Trainer that survive Helm uninstall.

Motivation / Context

After undeploy.sh completes, 6 orphaned CRDs remain on the cluster:

  • configs.kai.scheduler, schedulingshards.kai.scheduler, topologies.kai.scheduler — created by the KAI operator at runtime, not by the Helm chart
  • trainjobs.trainer.kubeflow.org, trainingruntimes.trainer.kubeflow.org, clustertrainingruntimes.trainer.kubeflow.org — installed by the validator during conformance/performance tests

Neither set carries Helm labels, so the existing delete_release_cluster_resources label-based sweep misses them.

Fixes: N/A
Related: N/A

Type of Change

  • Bug fix (non-breaking change that fixes an issue)

Component(s) Affected

  • Bundlers (pkg/bundler, pkg/component/*)

Implementation Notes

  • Added explicit CRD cleanup in undeploy.sh.tmpl after namespace deletion, targeting kai.scheduler and trainer.kubeflow.org API groups
  • Added orphaned CRD detection in post-flight verification
  • Approach uses API group matching (not hardcoded CRD names) so new CRDs in these groups are automatically covered

Testing

make test  # All bundler tests pass

Verified generated undeploy.sh includes the new cleanup steps.

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert

Rollout notes: Existing bundles won't have the fix — only newly generated bundles. Users with stale CRDs can manually delete them with kubectl delete crd <name>.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

@yuanchen8911 yuanchen8911 requested a review from a team as a code owner March 16, 2026 20:34
@yuanchen8911 yuanchen8911 added bug Something isn't working area/recipes labels Mar 16, 2026
@yuanchen8911 yuanchen8911 force-pushed the fix/undeploy-orphaned-crds branch from 701bf14 to 5a48c98 Compare March 16, 2026 20:36
Copy link
Copy Markdown
Member

@mchmarny mchmarny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good fix — the orphaned CRD problem is real and the API group matching approach is the right design. Two inline comments: one on CRD deletion ordering relative to namespace teardown, one on grep precision. Clean change overall.

Copy link
Copy Markdown
Member

@mchmarny mchmarny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CRD deletion ordering is the only real thing

…ploy

KAI scheduler CRDs (configs, schedulingshards, topologies) are created
by the KAI operator at runtime, not by the Helm chart. Kubeflow Trainer
CRDs are installed by the validator during conformance/performance tests.
Neither carries Helm labels, so the existing label-based CRD sweep in
delete_release_cluster_resources misses them.

- Add explicit CRD cleanup before namespace deletion (while controllers
  still run) with --wait=false to avoid finalizer hangs
- Anchor grep patterns with $ to prevent false matches
- Add orphaned CRD detection in undeploy post-flight verification
- Add orphaned CRD warning in deploy pre-flight checks

Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
@yuanchen8911 yuanchen8911 force-pushed the fix/undeploy-orphaned-crds branch from 5a48c98 to 849a6f0 Compare March 16, 2026 21:57
@yuanchen8911
Copy link
Copy Markdown
Contributor Author

Thanks Mark — both addressed:

# Issue Fix
1 CRD cleanup runs after namespace deletion — controllers already gone, finalizers can't be processed Moved CRD cleanup before namespace deletion loop; added --wait=false as safety net
2 grep "\.${group}" not anchored — could false-match CRDs like foo.xkai.scheduler.io Anchored with $ in all three locations (undeploy cleanup, undeploy post-flight, deploy pre-flight)

@yuanchen8911 yuanchen8911 merged commit 7c377c1 into NVIDIA:main Mar 16, 2026
13 checks passed
xdu31 pushed a commit to xdu31/aicr that referenced this pull request Mar 24, 2026
…ploy (NVIDIA#416)

Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
Co-authored-by: Mark Chmarny <mchmarny@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/bundler bug Something isn't working size/S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants