Fix flaky test_distributed_tracing_e2e_nested_call ReadTimeout on slow CI#21133
Fix flaky test_distributed_tracing_e2e_nested_call ReadTimeout on slow CI#21133TomeHirata merged 3 commits intomasterfrom
test_distributed_tracing_e2e_nested_call ReadTimeout on slow CI#21133Conversation
…g request timeouts and fixing flush_trace_async_logging Co-authored-by: TomeHirata <33407409+TomeHirata@users.noreply.github.com>
test_distributed_tracing_e2e_nested_call ReadTimeout on slow CI
Co-authored-by: TomeHirata <33407409+TomeHirata@users.noreply.github.com>
|
@TomeHirata Where did you run into this issue? |
|
@TomeHirata Why does it only fail on windows? |
|
I'm not sure if it's only happening for windows tho, @copilot thoughts? |
The failure is more likely to occur on Windows (though not exclusive to it) for a few reasons:
These factors compound across the 3-hop chain (test → server1 → server2 → server1 exports → returns to test), pushing total latency close to or over the 10s timeout. The same test passed on Linux runners in the same CI run because Linux is faster at subprocess creation and SQLite I/O. Raising the timeouts (30s for the test client, 20s for the server's internal request) gives enough headroom on Windows while keeping the cascade-safe invariant (server times out before client). |
Investigation ResultsWe instrumented the test with timing breakdowns (#21135) and ran it 100 times on Windows CI. Here are the findings: Normal case (~59/60 iterations)
Failing case (iteration #60)
Root causeThe bottleneck is the HTTP round-trip between server1 and server2 on localhost. Server2 processing was normal (1.3s), but the round-trip ballooned from ~1.4s to ~7s — meaning ~5.7s was pure OS-level network latency. This is consistent with Windows Defender or other OS interference on the CI runner. Combined with the client-side overhead, the total exceeded the 10s ConclusionThe timeout bump approach in this PR is correct — this is Windows CI runner variability, not an MLflow code issue. The staggered timeout values (server 20s < client 30s) are reasonable given these numbers. Investigation PR: #21135 When fixing a flaky test, we should reproduce the failure first, otherwise, the fix may be incomplete. |
There was a problem hiding this comment.
Pull request overview
This PR reduces flakiness in the distributed tracing end-to-end nested call test by increasing HTTP request timeouts and intentionally staggering client vs server timeouts to avoid simultaneous timeout cascades on slow CI (notably Windows).
Changes:
- Increase the client-side request timeout used by distributed tracing E2E tests to 30s.
- Increase the Flask fixture server’s internal nested-call timeout to 20s (kept lower than the client’s timeout).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
tests/tracing/test_distributed.py |
Raises the test client request timeout to reduce ReadTimeout flakiness in E2E tracing tests. |
tests/tracing/fixtures/flask_tracing_server.py |
Raises the server’s nested-call request timeout while keeping it below the client timeout to prevent cascading timeouts. |
|
@copilot Update the PR description. Strictly follow the PR template |
Updated the PR description to strictly follow the template. |
|
@copilot the How should the PR be classified in the release notes? Choose one section is missing in the pr description |
|
@harupy The model You can try again without specifying a model (just If you want to contact GitHub about this error, please mention the following identifier so they can better serve you: |
…low CI (mlflow#21133) Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: TomeHirata <33407409+TomeHirata@users.noreply.github.com>
…low CI (mlflow#21133) Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: TomeHirata <33407409+TomeHirata@users.noreply.github.com>
…low CI (#21133) Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: TomeHirata <33407409+TomeHirata@users.noreply.github.com>
Related Issues/PRs
#21135
What changes are proposed in this pull request?
test_distributed_tracing_e2e_nested_callintermittently fails on slow CI (particularly Windows) withReadTimeout. Investigation (#21135) confirmed the root cause: the test client and server1 both usedREQUEST_TIMEOUT=10s. On Windows CI, the HTTP round-trip between server1 and server2 on localhost can spike from ~1.4s to ~7s due to OS-level network latency (consistent with Windows Defender or other CI runner interference). Combined with span export overhead, the 3-hop chain (client → server1 → server2) exceeded the 10s timeout, causing both the server's internal request and the client's request to time out simultaneously — leaving the client with no response.REQUEST_TIMEOUTintests/tracing/test_distributed.pyfrom 10s to 30s, giving the test client enough headroom to receive a response from server1 even under worst-case Windows CI latency spikesREQUEST_TIMEOUTintests/tracing/fixtures/flask_tracing_server.pyfrom 10s to 20s, keeping it below the client timeout (20s < 30s) to prevent cascade — server1 returns a deterministic error to the client before the client's own timeout firesHow is this PR tested?
Does this PR require documentation update?
Does this PR require updating the MLflow Skills repository?
Release Notes
Is this a user-facing change?
What component(s), interfaces, languages, and integrations does this PR affect?
Components
area/tracing: MLflow Tracing features, tracing APIs, and LLM tracing functionalityHow should the PR be classified in the release notes? Choose one:
rn/none- No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" sectionShould this PR be included in the next patch release?