Skip to content

E2E: Fix flaky Helm installation failures due to "cannot re-use a nam…#284

Merged
sanjaychatterjee merged 1 commit into
ai-dynamo:mainfrom
shmuel-runai:RUN-34234/main-e2e-cluster-replace
Dec 5, 2025
Merged

E2E: Fix flaky Helm installation failures due to "cannot re-use a nam…#284
sanjaychatterjee merged 1 commit into
ai-dynamo:mainfrom
shmuel-runai:RUN-34234/main-e2e-cluster-replace

Conversation

@shmuel-runai

Copy link
Copy Markdown
Contributor

What type of PR is this?
/kind e2e
/kind bug

What this PR does / why we need it:
E2E tests are flaky - occasionally failing during cluster setup with the following error:

❌ Kai Scheduler installation failed on attempt 1/3: helm install failed:
    the server could not find the requested resource

❌ Kai Scheduler installation failed on attempt 2/3: helm install failed:
    cannot re-use a name that is still in use

❌ Kai Scheduler installation failed on attempt 3/3: helm install failed:
    cannot re-use a name that is still in use
  • Root Cause
    When a Helm install fails partway through (e.g., due to a race condition where the Kubernetes API server isn't fully ready), Helm leaves a release record in a failed/pending state. The existing retry logic attempts helm install again with the same release name, but Helm rejects this with "cannot re-use a name that is still in use" - even though the release is in a failed state.

  • Solution
    Add Replace = true to the Helm install client configuration. This allows Helm to replace releases that are in a failed/pending state, enabling retries to work correctly.

  • Testing
    Ran e2e tests multiple times to verify the fix handles the race condition
    The Replace flag only affects releases in failed/pending states, not successful releases

Does this PR introduce a API change?

NONE

…e" error

E2E tests are flaky - occasionally failing during cluster setup with the following error:
```
❌ Kai Scheduler installation failed on attempt 1/3: helm install failed:
    the server could not find the requested resource

❌ Kai Scheduler installation failed on attempt 2/3: helm install failed:
    cannot re-use a name that is still in use

❌ Kai Scheduler installation failed on attempt 3/3: helm install failed:
    cannot re-use a name that is still in use
```

* Root Cause
When a Helm install fails partway through (e.g., due to a race condition where the Kubernetes API server isn't fully ready), Helm leaves a release record in a failed/pending state. The existing retry logic attempts helm install again with the same release name, but Helm rejects this with "cannot re-use a name that is still in use" - even though the release is in a failed state.

* Solution
Add Replace = true to the Helm install client configuration. This allows Helm to replace releases that are in a failed/pending state, enabling retries to work correctly.

* Testing
Ran e2e tests multiple times to verify the fix handles the race condition
The Replace flag only affects releases in failed/pending states, not successful releases
@sanjaychatterjee sanjaychatterjee merged commit d462e65 into ai-dynamo:main Dec 5, 2025
4 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants