Skip to content

Fix flaky test: Aspire.Hosting.OpenAI.Tests.OpenAIFunctionalTests.DependentResourceWaitsForOpenAIModelResourceWithHealthCheckToBeHealthy#14928

Open
radical wants to merge 1 commit intomainfrom
fix-openai-flaky
Open

Fix flaky test: Aspire.Hosting.OpenAI.Tests.OpenAIFunctionalTests.DependentResourceWaitsForOpenAIModelResourceWithHealthCheckToBeHealthy#14928
radical wants to merge 1 commit intomainfrom
fix-openai-flaky

Conversation

@radical
Copy link
Member

@radical radical commented Mar 4, 2026

Flaky Test Fix

Test

Root Cause

The OpenAIModelResource implements IResourceWithParent<OpenAIResource>, so ResourceHealthCheckService discovers health checks from both the model and its parent via TryGetAnnotationsIncludingAncestorsOfType<HealthCheckAnnotation>(). The parent's resource_check health check makes an external HTTP call to https://status.openai.com/api/v2/status.json, which can timeout or fail in CI, preventing the model resource from ever reaching healthy status.

Fix

Remove the parent's resource_check health check annotation in the test before building the app, matching the pattern already used by the non-quarantined sibling test (DependentResourceWaitsForOpenAIResourceWithHealthCheckToBeHealthy).

Verification

Phase Run Config Result
Verification (post-fix) 22652359417 5 runners × 10 iterations × ubuntu-latest All 50 passed

Local runs:

  • Post-fix: 10/10 passed on macOS (Darwin/arm64)

Verification Rationale

High confidence — root cause is a clear pattern match (external HTTP calls in health check inherited by child resource). The sibling test already demonstrates the correct fix pattern. CI verification confirms 50/50 pass on Linux, the most affected OS (14% failure rate pre-fix).

Notes

  • [QuarantinedTest] attribute kept — unquarantining happens separately after 21 days of zero failures

Note: This PR intentionally does not close #10977. The test will remain quarantined until stability is confirmed.


This fix was generated using the fix-flaky-test skill.

Copilot AI review requested due to automatic review settings March 4, 2026 02:37
@github-actions
Copy link
Contributor

github-actions bot commented Mar 4, 2026

🚀 Dogfood this PR with:

⚠️ WARNING: Do not do this without first carefully reviewing the code of this PR to satisfy yourself it is safe.

curl -fsSL https://raw.githubusercontent.com/dotnet/aspire/main/eng/scripts/get-aspire-cli-pr.sh | bash -s -- 14928

Or

  • Run remotely in PowerShell:
iex "& { $(irm https://raw.githubusercontent.com/dotnet/aspire/main/eng/scripts/get-aspire-cli-pr.ps1) } 14928"

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes flakiness in OpenAIFunctionalTests.DependentResourceWaitsForOpenAIModelResourceWithHealthCheckToBeHealthy by preventing the model resource from inheriting the parent OpenAI resource’s default health check that triggers external HTTP calls during test execution.

Changes:

  • Refactors the test to keep a handle to the parent OpenAI resource builder.
  • Removes the parent OpenAI resource’s default "resource_check" health check annotation before building the app to avoid external calls to status.openai.com.

@radical radical changed the title Fix flaky test: remove inherited parent health check making external HTTP calls Fix flaky test: Aspire.Hosting.OpenAI.Tests.OpenAIFunctionalTests.DependentResourceWaitsForOpenAIModelResourceWithHealthCheckToBeHealthy Mar 4, 2026
Comment on lines +34 to +37
// Remove the default status page health check from the parent OpenAI resource
// to avoid external HTTP calls to status.openai.com during tests.
var statusPageHealthCheck = Assert.Single(openai.Resource.Annotations, x => x is HealthCheckAnnotation hca && hca.Key == "resource_check");
openai.Resource.Annotations.Remove(statusPageHealthCheck);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about removing everything other than blocking_check? That's the only one this test cares about.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about DependentResourceWaitsForOpenAIResourceWithHealthCheckToBeHealthy?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot respond to the feedback, and handle that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about removing everything other than blocking_check? That's the only one this test cares about.

Is this statement even true?

If the idea of the tests is to run OAI health checks, and the OAI health checks then ping the OAI servers, then these tests are inheritanly flaky depending on network access to OAI and OAI being up

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

…HTTP calls

The OpenAI model resource inherits the parent resource's health check via
TryGetAnnotationsIncludingAncestorsOfType. The parent's resource_check calls
https://status.openai.com/api/v2/status.json which can timeout in CI,
preventing the model from reaching healthy status.

Remove the parent's status page health check in the test, matching the
pattern already used by the non-quarantined sibling test.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@radical radical force-pushed the fix-openai-flaky branch from 826e6dc to 7a8a73c Compare March 4, 2026 02:49
Copy link
Contributor

Copilot AI commented Mar 4, 2026

@radical I've opened a new pull request, #14929, to work on those changes. Once the pull request is ready, I'll request review from you.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 4, 2026

🎬 CLI E2E Test Recordings

The following terminal recordings are available for commit 7a8a73c:

Test Recording
AddPackageInteractiveWhileAppHostRunningDetached ▶️ View Recording
AddPackageWhileAppHostRunningDetached ▶️ View Recording
AgentCommands_AllHelpOutputs_AreCorrect ▶️ View Recording
AgentInitCommand_MigratesDeprecatedConfig ▶️ View Recording
AgentInitCommand_WithMalformedMcpJson_ShowsErrorAndExitsNonZero ▶️ View Recording
AspireUpdateRemovesAppHostPackageVersionFromDirectoryPackagesProps ▶️ View Recording
Banner_DisplayedOnFirstRun ▶️ View Recording
Banner_DisplayedWithExplicitFlag ▶️ View Recording
CreateAndDeployToDockerCompose ❌ Upload failed
CreateAndDeployToDockerComposeInteractive ▶️ View Recording
CreateAndPublishToKubernetes ▶️ View Recording
CreateAndRunAspireStarterProject ▶️ View Recording
CreateAndRunAspireStarterProjectWithBundle ▶️ View Recording
CreateAndRunJsReactProject ▶️ View Recording
CreateAndRunPythonReactProject ▶️ View Recording
CreateAndRunTypeScriptStarterProject ▶️ View Recording
CreateEmptyAppHostProject ▶️ View Recording
CreateStartAndStopAspireProject ▶️ View Recording
CreateStartWaitAndStopAspireProject ▶️ View Recording
CreateTypeScriptAppHostWithViteApp ▶️ View Recording
DescribeCommandResolvesReplicaNames ▶️ View Recording
DescribeCommandShowsRunningResources ▶️ View Recording
DetachFormatJsonProducesValidJson ▶️ View Recording
DoctorCommand_DetectsDeprecatedAgentConfig ▶️ View Recording
DoctorCommand_WithSslCertDir_ShowsTrusted ▶️ View Recording
DoctorCommand_WithoutSslCertDir_ShowsPartiallyTrusted ▶️ View Recording
LogsCommandShowsResourceLogs ▶️ View Recording
PsCommandListsRunningAppHost ▶️ View Recording
PsFormatJsonOutputsOnlyJsonToStdout ▶️ View Recording
SecretCrudOnDotNetAppHost ▶️ View Recording
SecretCrudOnTypeScriptAppHost ▶️ View Recording
StagingChannel_ConfigureAndVerifySettings_ThenSwitchChannels ▶️ View Recording
StopAllAppHostsFromAppHostDirectory ▶️ View Recording
StopAllAppHostsFromUnrelatedDirectory ▶️ View Recording
StopNonInteractiveMultipleAppHostsShowsError ▶️ View Recording
StopNonInteractiveSingleAppHost ▶️ View Recording
StopWithNoRunningAppHostExitsSuccessfully ▶️ View Recording

📹 Recordings uploaded automatically from CI run #22652814908

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Failing test]: DependentResourceWaitsForOpenAIModelResourceWithHealthCheckToBeHealthy

4 participants