Skip to content

Fix flaky test_distributed_tracing_e2e_nested_call ReadTimeout on slow CI#21133

Merged
TomeHirata merged 3 commits intomasterfrom
copilot/fix-e2e-test-timeout-issue
Feb 26, 2026
Merged

Fix flaky test_distributed_tracing_e2e_nested_call ReadTimeout on slow CI#21133
TomeHirata merged 3 commits intomasterfrom
copilot/fix-e2e-test-timeout-issue

Conversation

Copy link
Contributor

Copilot AI commented Feb 25, 2026

Related Issues/PRs

#21135

What changes are proposed in this pull request?

test_distributed_tracing_e2e_nested_call intermittently fails on slow CI (particularly Windows) with ReadTimeout. Investigation (#21135) confirmed the root cause: the test client and server1 both used REQUEST_TIMEOUT=10s. On Windows CI, the HTTP round-trip between server1 and server2 on localhost can spike from ~1.4s to ~7s due to OS-level network latency (consistent with Windows Defender or other CI runner interference). Combined with span export overhead, the 3-hop chain (client → server1 → server2) exceeded the 10s timeout, causing both the server's internal request and the client's request to time out simultaneously — leaving the client with no response.

  • Increase REQUEST_TIMEOUT in tests/tracing/test_distributed.py from 10s to 30s, giving the test client enough headroom to receive a response from server1 even under worst-case Windows CI latency spikes
  • Increase REQUEST_TIMEOUT in tests/tracing/fixtures/flask_tracing_server.py from 10s to 20s, keeping it below the client timeout (20s < 30s) to prevent cascade — server1 returns a deterministic error to the client before the client's own timeout fires

How is this PR tested?

  • Existing unit/integration tests

Does this PR require documentation update?

  • No. You can skip the rest of this section.

Does this PR require updating the MLflow Skills repository?

  • No. You can skip the rest of this section.

Release Notes

Is this a user-facing change?

  • No. You can skip the rest of this section.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

  • area/tracing: MLflow Tracing features, tracing APIs, and LLM tracing functionality

How should the PR be classified in the release notes? Choose one:

  • rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section

Should this PR be included in the next patch release?

  • Yes (this PR will be cherry-picked and included in the next patch release)
  • No (this PR will be included in the next minor release)

…g request timeouts and fixing flush_trace_async_logging

Co-authored-by: TomeHirata <33407409+TomeHirata@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix ReadTimeout error in E2E distributed tracing test Fix flaky test_distributed_tracing_e2e_nested_call ReadTimeout on slow CI Feb 25, 2026
Copilot AI requested a review from TomeHirata February 25, 2026 11:09
Co-authored-by: TomeHirata <33407409+TomeHirata@users.noreply.github.com>
@harupy
Copy link
Member

harupy commented Feb 25, 2026

@TomeHirata Where did you run into this issue?

@TomeHirata
Copy link
Collaborator

@harupy
Copy link
Member

harupy commented Feb 25, 2026

@TomeHirata Why does it only fail on windows?

@TomeHirata
Copy link
Collaborator

I'm not sure if it's only happening for windows tho, @copilot thoughts?

Copy link
Contributor Author

Copilot AI commented Feb 25, 2026

I'm not sure if it's only happening for windows tho, @copilot thoughts?

The failure is more likely to occur on Windows (though not exclusive to it) for a few reasons:

  1. Process spawn overhead: Windows cannot fork() — it must spawn() subprocesses, which requires re-importing all Python modules (Flask, MLflow, etc.) from scratch. Starting each Flask server in subprocess.Popen(...) is significantly slower on Windows than Linux.

  2. SQLite I/O: The synchronous span export in the request chain writes to SQLite. Windows has stricter file locking and higher I/O overhead, making this step slower.

  3. Localhost networking latency: Windows Defender/Firewall can add latency even for 127.0.0.1 connections.

These factors compound across the 3-hop chain (test → server1 → server2 → server1 exports → returns to test), pushing total latency close to or over the 10s timeout. The same test passed on Linux runners in the same CI run because Linux is faster at subprocess creation and SQLite I/O.

Raising the timeouts (30s for the test client, 20s for the server's internal request) gives enough headroom on Windows while keeping the cascade-safe invariant (server times out before client).

@harupy
Copy link
Member

harupy commented Feb 25, 2026

Investigation Results

We instrumented the test with timing breakdowns (#21135) and ran it 100 times on Windows CI. Here are the findings:

Normal case (~59/60 iterations)

Phase Duration
Server startup ~2.4s
set_context ~0.001s
start_span ~1.15s
Nested HTTP call (server1 → server2) ~1.4s
Client total round-trip ~2.8s

Failing case (iteration #60)

Phase Duration
start_span (server1) 1.511s
start_span (server2) 1.309s
Nested HTTP call (server1 → server2) 6.992s
/handle1 total 8.505s

Root cause

The bottleneck is the HTTP round-trip between server1 and server2 on localhost. Server2 processing was normal (1.3s), but the round-trip ballooned from ~1.4s to ~7s — meaning ~5.7s was pure OS-level network latency. This is consistent with Windows Defender or other OS interference on the CI runner.

Combined with the client-side overhead, the total exceeded the 10s REQUEST_TIMEOUT, causing the ReadTimeout.

Conclusion

The timeout bump approach in this PR is correct — this is Windows CI runner variability, not an MLflow code issue. The staggered timeout values (server 20s < client 30s) are reasonable given these numbers.


Investigation PR: #21135


When fixing a flaky test, we should reproduce the failure first, otherwise, the fix may be incomplete.

@harupy harupy marked this pull request as ready for review February 25, 2026 13:16
Copilot AI review requested due to automatic review settings February 25, 2026 13:16
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reduces flakiness in the distributed tracing end-to-end nested call test by increasing HTTP request timeouts and intentionally staggering client vs server timeouts to avoid simultaneous timeout cascades on slow CI (notably Windows).

Changes:

  • Increase the client-side request timeout used by distributed tracing E2E tests to 30s.
  • Increase the Flask fixture server’s internal nested-call timeout to 20s (kept lower than the client’s timeout).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
tests/tracing/test_distributed.py Raises the test client request timeout to reduce ReadTimeout flakiness in E2E tracing tests.
tests/tracing/fixtures/flask_tracing_server.py Raises the server’s nested-call request timeout while keeping it below the client timeout to prevent cascading timeouts.

@harupy
Copy link
Member

harupy commented Feb 25, 2026

@copilot Update the PR description. Strictly follow the PR template

Copy link
Contributor Author

Copilot AI commented Feb 25, 2026

@copilot Update the PR description. Strictly follow the PR template

Updated the PR description to strictly follow the template.

Copilot AI requested a review from harupy February 25, 2026 13:31
@harupy
Copy link
Member

harupy commented Feb 25, 2026

@copilot the How should the PR be classified in the release notes? Choose one section is missing in the pr description

Copy link
Contributor Author

Copilot AI commented Feb 25, 2026

@harupy The model claude-sonnet-4.6 is not available for your account. This can happen if the model was disabled by your organization's policy or if your Copilot plan doesn't include access to it.

You can try again without specifying a model (just @copilot) to use the default, or choose a different model from the model picker.

If you want to contact GitHub about this error, please mention the following identifier so they can better serve you: e707690f-9274-4b9e-b573-b8ed32cca0ac

@github-actions github-actions bot added v3.10.1 area/tracing MLflow Tracing and its integrations rn/none List under Small Changes in Changelogs. labels Feb 25, 2026
Copy link
Collaborator

@TomeHirata TomeHirata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@TomeHirata TomeHirata added this pull request to the merge queue Feb 26, 2026
Merged via the queue into master with commit e190c79 Feb 26, 2026
101 of 106 checks passed
@TomeHirata TomeHirata deleted the copilot/fix-e2e-test-timeout-issue branch February 26, 2026 00:44
daniellok-db pushed a commit to daniellok-db/mlflow that referenced this pull request Mar 5, 2026
…low CI (mlflow#21133)

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: TomeHirata <33407409+TomeHirata@users.noreply.github.com>
daniellok-db pushed a commit to daniellok-db/mlflow that referenced this pull request Mar 5, 2026
…low CI (mlflow#21133)

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: TomeHirata <33407409+TomeHirata@users.noreply.github.com>
daniellok-db pushed a commit that referenced this pull request Mar 5, 2026
…low CI (#21133)

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: TomeHirata <33407409+TomeHirata@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/tracing MLflow Tracing and its integrations rn/none List under Small Changes in Changelogs. size/XS v3.10.1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants