-
Notifications
You must be signed in to change notification settings - Fork 22
Closed
Copy link
Description
Prerequisites
- I searched existing issues and found no duplicates
- I can reproduce this issue consistently
- This is not a security vulnerability (use Security Advisories instead)
Bug Description
When running deploy.sh after having generated a bundle from a recipe (specifically CUJ2), I had some issues on deploying:
- first deploy failed on a context deadline exceeded on CRD application
- second deploy attempted to do it idempotently, but ran into an issue with the webhook configurations that is being fixed
- third deploy failed but was prefaced by an undeploy.sh but the undeploy left some cert-manager CRDs around due to some resource policies
- fourth deploy failed prefaced by an undeploy.sh that succeeded
- fifth deploy succeeded
Would be nice if we had better retry mechanisms on this as the recovery is to just undeploy and deploy which is not ideal
Impact
Low (minor issue)
Component
Other / Unknown
Regression?
Yes, this worked before (please specify version below)
Steps to Reproduce
CUJ2.md
Expected Behavior
I expect it flake less or have enough perpetual retry mechanisms to eventually succeed
Actual Behavior
It flaked and didn't succeed for a bit.
Environment
- AICR version (CLI
aicr version, API image tag, or commit SHA): - Install method (release binary / build from source / container image):
- Platform (eks/gke/aks/self-managed):
- Kubernetes version:
- OS (ubuntu/cos/other) + version:
- Kernel version:
- GPU type (h100/gb200/a100/l40/other):
- Workload intent (training/inference):
Command / Request Used
No response
Logs / Error Output
Additional Context
No response
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working