Skip to content

[Bug]: deploy.sh is flakey with context deadline exceeded on CRD application #335

@njtran

Description

@njtran

Prerequisites

  • I searched existing issues and found no duplicates
  • I can reproduce this issue consistently
  • This is not a security vulnerability (use Security Advisories instead)

Bug Description

When running deploy.sh after having generated a bundle from a recipe (specifically CUJ2), I had some issues on deploying:

  • first deploy failed on a context deadline exceeded on CRD application
  • second deploy attempted to do it idempotently, but ran into an issue with the webhook configurations that is being fixed
  • third deploy failed but was prefaced by an undeploy.sh but the undeploy left some cert-manager CRDs around due to some resource policies
  • fourth deploy failed prefaced by an undeploy.sh that succeeded
  • fifth deploy succeeded

Would be nice if we had better retry mechanisms on this as the recovery is to just undeploy and deploy which is not ideal

Impact

Low (minor issue)

Component

Other / Unknown

Regression?

Yes, this worked before (please specify version below)

Steps to Reproduce

CUJ2.md

Expected Behavior

I expect it flake less or have enough perpetual retry mechanisms to eventually succeed

Actual Behavior

It flaked and didn't succeed for a bit.

Environment

  • AICR version (CLI aicr version, API image tag, or commit SHA):
  • Install method (release binary / build from source / container image):
  • Platform (eks/gke/aks/self-managed):
  • Kubernetes version:
  • OS (ubuntu/cos/other) + version:
  • Kernel version:
  • GPU type (h100/gb200/a100/l40/other):
  • Workload intent (training/inference):

Command / Request Used

No response

Logs / Error Output

Additional Context

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions