Fix flaky `test_distributed_tracing_e2e_nested_call` ReadTimeout on slow CI by Copilot · Pull Request #21133 · mlflow/mlflow

Copilot · 2026-02-25T10:45:15Z

Related Issues/PRs

What changes are proposed in this pull request?

test_distributed_tracing_e2e_nested_call intermittently fails on slow CI (particularly Windows) with ReadTimeout. Investigation (#21135) confirmed the root cause: the test client and server1 both used REQUEST_TIMEOUT=10s. On Windows CI, the HTTP round-trip between server1 and server2 on localhost can spike from ~1.4s to ~7s due to OS-level network latency (consistent with Windows Defender or other CI runner interference). Combined with span export overhead, the 3-hop chain (client → server1 → server2) exceeded the 10s timeout, causing both the server's internal request and the client's request to time out simultaneously — leaving the client with no response.

Increase REQUEST_TIMEOUT in tests/tracing/test_distributed.py from 10s to 30s, giving the test client enough headroom to receive a response from server1 even under worst-case Windows CI latency spikes
Increase REQUEST_TIMEOUT in tests/tracing/fixtures/flask_tracing_server.py from 10s to 20s, keeping it below the client timeout (20s < 30s) to prevent cascade — server1 returns a deterministic error to the client before the client's own timeout fires

How is this PR tested?

Existing unit/integration tests

Does this PR require documentation update?

No. You can skip the rest of this section.

Does this PR require updating the MLflow Skills repository?

No. You can skip the rest of this section.

Release Notes

Is this a user-facing change?

No. You can skip the rest of this section.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

area/tracing: MLflow Tracing features, tracing APIs, and LLM tracing functionality

How should the PR be classified in the release notes? Choose one:

rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section

Should this PR be included in the next patch release?

Yes (this PR will be cherry-picked and included in the next patch release)
No (this PR will be included in the next minor release)

…g request timeouts and fixing flush_trace_async_logging Co-authored-by: TomeHirata <33407409+TomeHirata@users.noreply.github.com>

mlflow/tracking/fluent.py

Co-authored-by: TomeHirata <33407409+TomeHirata@users.noreply.github.com>

harupy · 2026-02-25T11:15:45Z

@TomeHirata Where did you run into this issue?

TomeHirata · 2026-02-25T11:25:57Z

https://github.com/mlflow/mlflow/actions/runs/22391230439/job/64813498792

harupy · 2026-02-25T11:26:25Z

@TomeHirata Why does it only fail on windows?

TomeHirata · 2026-02-25T11:44:05Z

I'm not sure if it's only happening for windows tho, @copilot thoughts?

Copilot · 2026-02-25T11:46:22Z

I'm not sure if it's only happening for windows tho, @copilot thoughts?

The failure is more likely to occur on Windows (though not exclusive to it) for a few reasons:

Process spawn overhead: Windows cannot fork() — it must spawn() subprocesses, which requires re-importing all Python modules (Flask, MLflow, etc.) from scratch. Starting each Flask server in subprocess.Popen(...) is significantly slower on Windows than Linux.
SQLite I/O: The synchronous span export in the request chain writes to SQLite. Windows has stricter file locking and higher I/O overhead, making this step slower.
Localhost networking latency: Windows Defender/Firewall can add latency even for 127.0.0.1 connections.

These factors compound across the 3-hop chain (test → server1 → server2 → server1 exports → returns to test), pushing total latency close to or over the 10s timeout. The same test passed on Linux runners in the same CI run because Linux is faster at subprocess creation and SQLite I/O.

Raising the timeouts (30s for the test client, 20s for the server's internal request) gives enough headroom on Windows while keeping the cascade-safe invariant (server times out before client).

harupy · 2026-02-25T12:28:45Z

Investigation Results

We instrumented the test with timing breakdowns (#21135) and ran it 100 times on Windows CI. Here are the findings:

Normal case (~59/60 iterations)

Phase	Duration
Server startup	~2.4s
`set_context`	~0.001s
`start_span`	~1.15s
Nested HTTP call (server1 → server2)	~1.4s
Client total round-trip	~2.8s

Failing case (iteration #60)

Phase	Duration
`start_span` (server1)	1.511s
`start_span` (server2)	1.309s
Nested HTTP call (server1 → server2)	6.992s
`/handle1` total	8.505s

Root cause

The bottleneck is the HTTP round-trip between server1 and server2 on localhost. Server2 processing was normal (1.3s), but the round-trip ballooned from ~1.4s to ~7s — meaning ~5.7s was pure OS-level network latency. This is consistent with Windows Defender or other OS interference on the CI runner.

Combined with the client-side overhead, the total exceeded the 10s REQUEST_TIMEOUT, causing the ReadTimeout.

Conclusion

The timeout bump approach in this PR is correct — this is Windows CI runner variability, not an MLflow code issue. The staggered timeout values (server 20s < client 30s) are reasonable given these numbers.

Investigation PR: #21135

When fixing a flaky test, we should reproduce the failure first, otherwise, the fix may be incomplete.

Copilot

Pull request overview

This PR reduces flakiness in the distributed tracing end-to-end nested call test by increasing HTTP request timeouts and intentionally staggering client vs server timeouts to avoid simultaneous timeout cascades on slow CI (notably Windows).

Changes:

Increase the client-side request timeout used by distributed tracing E2E tests to 30s.
Increase the Flask fixture server’s internal nested-call timeout to 20s (kept lower than the client’s timeout).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
`tests/tracing/test_distributed.py`	Raises the test client request timeout to reduce `ReadTimeout` flakiness in E2E tracing tests.
`tests/tracing/fixtures/flask_tracing_server.py`	Raises the server’s nested-call request timeout while keeping it below the client timeout to prevent cascading timeouts.

harupy · 2026-02-25T13:27:55Z

@copilot Update the PR description. Strictly follow the PR template

Copilot · 2026-02-25T13:30:34Z

@copilot Update the PR description. Strictly follow the PR template

Updated the PR description to strictly follow the template.

harupy · 2026-02-25T14:20:27Z

@copilot the How should the PR be classified in the release notes? Choose one section is missing in the pr description

Copilot · 2026-02-25T14:20:32Z

@harupy The model claude-sonnet-4.6 is not available for your account. This can happen if the model was disabled by your organization's policy or if your Copilot plan doesn't include access to it.

You can try again without specifying a model (just @copilot) to use the default, or choose a different model from the model picker.

If you want to contact GitHub about this error, please mention the following identifier so they can better serve you: e707690f-9274-4b9e-b573-b8ed32cca0ac

TomeHirata

LGTM

…low CI (mlflow#21133) Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: TomeHirata <33407409+TomeHirata@users.noreply.github.com>

…low CI (#21133) Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: TomeHirata <33407409+TomeHirata@users.noreply.github.com>

Initial plan

7ebe9d4

Copilot AI assigned Copilot and TomeHirata Feb 25, 2026

Copilot started work on behalf of TomeHirata February 25, 2026 10:45 View session

Fix test_distributed_tracing_e2e_nested_call ReadTimeout by increasin…

00bbb9c

…g request timeouts and fixing flush_trace_async_logging Co-authored-by: TomeHirata <33407409+TomeHirata@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Fix ReadTimeout error in E2E distributed tracing test~~ Fix flaky test_distributed_tracing_e2e_nested_call ReadTimeout on slow CI Feb 25, 2026

Copilot AI requested a review from TomeHirata February 25, 2026 11:09

Copilot finished work on behalf of TomeHirata February 25, 2026 11:09

TomeHirata reviewed Feb 25, 2026

View reviewed changes

mlflow/tracking/fluent.py Outdated Show resolved Hide resolved

Copilot started work on behalf of TomeHirata February 25, 2026 11:12 View session

Revert flush_trace_async_logging change in fluent.py per review feedback

1e25106

Co-authored-by: TomeHirata <33407409+TomeHirata@users.noreply.github.com>

Copilot AI requested a review from TomeHirata February 25, 2026 11:15

Copilot finished work on behalf of TomeHirata February 25, 2026 11:15

Copilot started work on behalf of TomeHirata February 25, 2026 11:44 View session

Copilot finished work on behalf of TomeHirata February 25, 2026 11:46

harupy mentioned this pull request Feb 25, 2026

Add timing instrumentation to flaky distributed tracing test on Windows #21135

Closed

6 tasks

github-actions bot assigned harupy Feb 25, 2026

harupy marked this pull request as ready for review February 25, 2026 13:16

Copilot AI review requested due to automatic review settings February 25, 2026 13:16

github-actions bot added the size/XS label Feb 25, 2026

Copilot started reviewing on behalf of harupy February 25, 2026 13:17 View session

Copilot AI reviewed Feb 25, 2026

View reviewed changes

Copilot started work on behalf of harupy February 25, 2026 13:28 View session

Copilot AI requested a review from harupy February 25, 2026 13:31

Copilot finished work on behalf of harupy February 25, 2026 13:31

github-actions bot added v3.10.1 area/tracing MLflow Tracing and its integrations rn/none List under Small Changes in Changelogs. labels Feb 25, 2026

harupy approved these changes Feb 25, 2026

View reviewed changes

TomeHirata approved these changes Feb 26, 2026

View reviewed changes

TomeHirata added this pull request to the merge queue Feb 26, 2026

Merged via the queue into master with commit e190c79 Feb 26, 2026
101 of 106 checks passed

TomeHirata deleted the copilot/fix-e2e-test-timeout-issue branch February 26, 2026 00:44

Conversation

Copilot AI commented Feb 25, 2026 • edited by harupy Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Related Issues/PRs

What changes are proposed in this pull request?

How is this PR tested?

Does this PR require documentation update?

Does this PR require updating the MLflow Skills repository?

Release Notes

Is this a user-facing change?

What component(s), interfaces, languages, and integrations does this PR affect?

How should the PR be classified in the release notes? Choose one:

Should this PR be included in the next patch release?

Uh oh!

Uh oh!

harupy commented Feb 25, 2026

Uh oh!

TomeHirata commented Feb 25, 2026

Uh oh!

harupy commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomeHirata commented Feb 25, 2026

Uh oh!

Copilot AI commented Feb 25, 2026

Uh oh!

harupy commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Investigation Results

Normal case (~59/60 iterations)

Failing case (iteration #60)

Root cause

Conclusion

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

harupy commented Feb 25, 2026

Uh oh!

Copilot AI commented Feb 25, 2026

Uh oh!

harupy commented Feb 25, 2026

Uh oh!

Copilot AI commented Feb 25, 2026

Uh oh!

TomeHirata left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Copilot AI commented Feb 25, 2026 •

edited by harupy

Loading

harupy commented Feb 25, 2026 •

edited

Loading

harupy commented Feb 25, 2026 •

edited

Loading