Fix infinite fetch loop in trace detail view when num_spans metadata mismatches by coldzero94 · Pull Request #20596 · mlflow/mlflow

coldzero94 · 2026-02-05T11:23:41Z

Related Issues/PRs

Resolve #20595

What changes are proposed in this pull request?

Fix infinite fetch loop that occurs when viewing trace details if the mlflow.trace.sizeStats.num_spans metadata doesn't match the actual span count.

Problem: The polling logic in useGetTrace.tsx continued polling indefinitely after a trace reached the OK state, as long as the expected span count (from metadata) didn't match the actual span count. This caused:

Continuous network requests (2000+ requests observed)
High CPU usage in browser
Unresponsive UI

Root cause: The num_spans metadata can become inconsistent with actual data (e.g., metadata says 33 spans, but only 32 exist in DB). Since the frontend calls the API with allow_partial=true, the backend's span completeness check is bypassed entirely.

Solution: Add a bounded retry timeout for polling when the trace is in OK state but span count still doesn't match:

Continue polling for up to 60 attempts (~60s) to allow late-arriving child spans (covers OTLP BatchSpanProcessor worst case of 5s schedule delay + 30s export timeout)
Stop polling immediately on ERROR state
Reset the poll counter when navigating between traces to prevent count leaking

How is this PR tested?

Existing unit/integration tests
New unit/integration tests
Manual tests

Added unit tests for:

Bounded polling stops after max retries on span count mismatch
No polling on ERROR state
Poll counter resets when switching between traces

Does this PR require documentation update?

Does this PR require updating the MLflow Skills repository?

No. You can skip the rest of this section.
Yes. Please link the corresponding PR or explain how you plan to update it.

Release Notes

Is this a user-facing change?

No. You can skip the rest of this section.
Yes. Give a description of this change to be included in the release notes for MLflow users.

Fix infinite fetch loop in trace detail view when num_spans metadata mismatches actual span count.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

How should the PR be classified in the release notes? Choose one:

rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
rn/feature - A new user-facing feature worth mentioning in the release notes
rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
rn/documentation - A user-facing documentation change worth mentioning in the release notes

Should this PR be included in the next patch release?

What is a minor/patch release?

Minor release: a release that increments the second part of the version number (e.g., 1.2.0 -> 1.3.0).
Bug fixes, doc updates and new features usually go into minor releases.
Patch release: a release that increments the third part of the version number (e.g., 1.2.0 -> 1.2.1).
Bug fixes and doc updates usually go into patch releases.

Yes (this PR will be cherry-picked and included in the next patch release)
No (this PR will be included in the next minor release)

github-actions · 2026-02-05T11:23:54Z

🛠 DevTools 🛠

Install mlflow from this PR

# mlflow
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/20596/merge
# mlflow-skinny
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/20596/merge#subdirectory=libs/skinny

For Databricks, use the following command:

%sh curl -LsSf https://raw.githubusercontent.com/mlflow/mlflow/HEAD/dev/install-skinny.sh | sh -s pull/20596/merge

github-actions · 2026-02-05T11:23:55Z

@coldzero94 Thank you for the contribution! Could you fix the following issue(s)?

⚠ Invalid PR template

This PR does not appear to have been filed using the MLflow PR template. Please copy the PR template from here and fill it out.

github-actions · 2026-02-05T21:23:55Z

Documentation preview for 11e200e is available at:

https://pr-20596--mlflow-docs-preview.netlify.app/docs/latest/

More info

Ignore this comment if this PR does not change the documentation.
The preview is updated when a new commit is pushed to this PR.
This comment was created by this workflow run.
The documentation was built by this workflow run.

serena-ruan · 2026-02-06T06:44:45Z

mlflow/server/js/src/shared/web-shared/genai-traces-table/hooks/useGetTrace.tsx

+    // Note: Previously we compared span counts even for completed traces, but this caused
+    // infinite polling loops when num_spans metadata was inconsistent with actual span count.
+    // Since the trace is already complete, no more spans will arrive, so stop polling.
+    if (traceInfo?.state === 'ERROR' || traceInfo?.state === 'OK') {


I think we shouldn't rely on 'OK' status, as previously explained in line 53-55 comments that the status could be updated while some internal spans are in-parallel uploading.
I think we could use the other option you mentioned in the issue, set a retry timeout if status is 'OK', since in such case it shouldn't take longer for all child spans to be logged.

Thanks for the feedback! You're right that we shouldn't immediately stop polling on OK state since child spans may still be uploading in parallel.

I've updated the approach to add a bounded retry timeout: when the trace is in OK state but span count still doesn't match, we continue polling for up to 60 attempts (~60s). This covers the worst-case OTLP BatchSpanProcessor delay (5s schedule delay + 30s export timeout = 35s). The poll counter is also reset when navigating between traces to prevent count leaking.

Note: The frontend currently calls the API with allow_partial=true, so the backend's span completeness check is bypassed entirely. A more robust long-term fix could be adding a spans_complete flag to the allow_partial=true response, so the frontend can use the backend's authoritative completeness check. But that would be a separate PR.

…mismatches The polling logic in useGetTrace continued indefinitely when the num_spans metadata was inconsistent with the actual span count, even after the trace reached a terminal state. Add a bounded retry mechanism: after the trace reaches OK state, continue polling for up to 30 attempts (~30s) waiting for remaining child spans. If the count still doesn't match, stop polling to prevent infinite loops from metadata inconsistencies. This mirrors the backend's own retry approach in sqlalchemy_store.get_trace() which also uses bounded retries. Fixes mlflow#20595 Signed-off-by: Chan Young Lee <cyl0504@gmail.com>

Signed-off-by: Chan Young Lee <cyl0504@gmail.com>

serena-ruan · 2026-02-10T07:40:26Z

mlflow/server/js/src/shared/web-shared/genai-traces-table/hooks/useGetTrace.test.tsx

+
+      // Polling should eventually stop due to max retry count (not run forever)
+      // Wait a few seconds to verify polling occurs but is bounded
+      await new Promise((resolve) => setTimeout(resolve, 3000));


Could we use jest fake timers instead of real timers for testing purpose?

Done! Converted all tests to use Jest fake timers. All tests pass locally.

serena-ruan

LGTM! Thanks for the contribution :D Let's update tests and merge then!

Signed-off-by: Chan Young Lee <cyl0504@gmail.com>

coldzero94 · 2026-02-18T11:54:38Z

@serena-ruan Is it done? Can you check it?

daniellok-db

lgtm

…mismatches (mlflow#20596) Signed-off-by: Chan Young Lee <cyl0504@gmail.com> Co-authored-by: Serena Ruan <82044803+serena-ruan@users.noreply.github.com>

…mismatches (#20596) Signed-off-by: Chan Young Lee <cyl0504@gmail.com> Co-authored-by: Serena Ruan <82044803+serena-ruan@users.noreply.github.com>

github-actions bot added area/tracing MLflow Tracing and its integrations area/uiux Front-end, user experience, plotting, JavaScript, JavaScript dev server rn/bug-fix Mention under Bug Fixes in Changelogs. v3.9.1 labels Feb 5, 2026

coldzero94 force-pushed the fix/infinite-trace-fetch-loop branch from 4f6b374 to 2af3cdd Compare February 5, 2026 11:26

serena-ruan reviewed Feb 6, 2026

View reviewed changes

serena-ruan self-assigned this Feb 6, 2026

coldzero94 force-pushed the fix/infinite-trace-fetch-loop branch from 2af3cdd to ff8324a Compare February 6, 2026 07:00

coldzero94 requested a review from serena-ruan February 6, 2026 07:06

coldzero94 added 2 commits February 6, 2026 16:23

Reset polling counter when navigating between traces

67e7fe7

Signed-off-by: Chan Young Lee <cyl0504@gmail.com>

Increase polling timeout from 30s to 60s to cover OTLP batch export

d49bed6

Signed-off-by: Chan Young Lee <cyl0504@gmail.com>

serena-ruan reviewed Feb 10, 2026

View reviewed changes

Merge branch 'master' into fix/infinite-trace-fetch-loop

abe33d1

serena-ruan approved these changes Feb 10, 2026

View reviewed changes

serena-ruan added the v3.10.0 label Feb 10, 2026

Use Jest fake timers instead of real timers in tests

efd3f0e

Signed-off-by: Chan Young Lee <cyl0504@gmail.com>

coldzero94 requested a review from serena-ruan February 14, 2026 05:35

coldzero94 added 3 commits February 14, 2026 14:48

Merge branch 'master' into fix/infinite-trace-fetch-loop

a3f5e5b

Merge branch 'master' into fix/infinite-trace-fetch-loop

2319042

Merge branch 'master' into fix/infinite-trace-fetch-loop

fbe96ec

coldzero94 mentioned this pull request Feb 16, 2026

[BUG] Infinite fetch loop in trace detail view when num_spans metadata mismatches actual span count #20595

Closed

coldzero94 added 2 commits February 18, 2026 20:54

Merge branch 'master' into fix/infinite-trace-fetch-loop

8324d65

Merge branch 'master' into fix/infinite-trace-fetch-loop

11e200e

daniellok-db removed v3.10.0 v3.9.1 labels Feb 20, 2026

daniellok-db added the v3.10.0 label Feb 20, 2026

daniellok-db approved these changes Feb 20, 2026

View reviewed changes

daniellok-db added this pull request to the merge queue Feb 20, 2026

Merged via the queue into mlflow:master with commit 1d3c6b0 Feb 20, 2026
26 checks passed

github-actions bot added the size/M label Feb 20, 2026

coldzero94 deleted the fix/infinite-trace-fetch-loop branch February 20, 2026 12:06

coldzero94 mentioned this pull request Feb 23, 2026

Add spans_complete flag to GetTrace response to avoid redundant polling #21067

Open

31 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix infinite fetch loop in trace detail view when num_spans metadata mismatches#20596

Fix infinite fetch loop in trace detail view when num_spans metadata mismatches#20596
daniellok-db merged 10 commits intomlflow:masterfrom
coldzero94:fix/infinite-trace-fetch-loop

coldzero94 commented Feb 5, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 5, 2026

Install mlflow from this PR

Uh oh!

github-actions bot commented Feb 5, 2026

Uh oh!

github-actions bot commented Feb 5, 2026 •

edited

Loading

Uh oh!

serena-ruan Feb 6, 2026

Uh oh!

coldzero94 Feb 6, 2026 •

edited

Loading

Uh oh!

serena-ruan Feb 10, 2026

Uh oh!

coldzero94 Feb 14, 2026

Uh oh!

serena-ruan left a comment

Uh oh!

coldzero94 commented Feb 18, 2026

Uh oh!

daniellok-db left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

coldzero94 commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Related Issues/PRs

What changes are proposed in this pull request?

How is this PR tested?

Does this PR require documentation update?

Does this PR require updating the MLflow Skills repository?

Release Notes

Is this a user-facing change?

What component(s), interfaces, languages, and integrations does this PR affect?

How should the PR be classified in the release notes? Choose one:

Should this PR be included in the next patch release?

Uh oh!

github-actions bot commented Feb 5, 2026

Install mlflow from this PR

Uh oh!

github-actions bot commented Feb 5, 2026

⚠ Invalid PR template

Uh oh!

github-actions bot commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

serena-ruan Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

coldzero94 Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

serena-ruan Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

coldzero94 Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

serena-ruan left a comment

Choose a reason for hiding this comment

Uh oh!

coldzero94 commented Feb 18, 2026

Uh oh!

daniellok-db left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

coldzero94 commented Feb 5, 2026 •

edited

Loading

github-actions bot commented Feb 5, 2026 •

edited

Loading

coldzero94 Feb 6, 2026 •

edited

Loading