Skip to content

Fix infinite fetch loop in trace detail view when num_spans metadata mismatches#20596

Merged
daniellok-db merged 10 commits intomlflow:masterfrom
coldzero94:fix/infinite-trace-fetch-loop
Feb 20, 2026
Merged

Fix infinite fetch loop in trace detail view when num_spans metadata mismatches#20596
daniellok-db merged 10 commits intomlflow:masterfrom
coldzero94:fix/infinite-trace-fetch-loop

Conversation

@coldzero94
Copy link
Contributor

@coldzero94 coldzero94 commented Feb 5, 2026

Related Issues/PRs

Resolve #20595

What changes are proposed in this pull request?

Fix infinite fetch loop that occurs when viewing trace details if the mlflow.trace.sizeStats.num_spans metadata doesn't match the actual span count.

Problem: The polling logic in useGetTrace.tsx continued polling indefinitely after a trace reached the OK state, as long as the expected span count (from metadata) didn't match the actual span count. This caused:

  • Continuous network requests (2000+ requests observed)
  • High CPU usage in browser
  • Unresponsive UI

Root cause: The num_spans metadata can become inconsistent with actual data (e.g., metadata says 33 spans, but only 32 exist in DB). Since the frontend calls the API with allow_partial=true, the backend's span completeness check is bypassed entirely.

Solution: Add a bounded retry timeout for polling when the trace is in OK state but span count still doesn't match:

  • Continue polling for up to 60 attempts (~60s) to allow late-arriving child spans (covers OTLP BatchSpanProcessor worst case of 5s schedule delay + 30s export timeout)
  • Stop polling immediately on ERROR state
  • Reset the poll counter when navigating between traces to prevent count leaking

How is this PR tested?

  • Existing unit/integration tests
  • New unit/integration tests
  • Manual tests

Added unit tests for:

  • Bounded polling stops after max retries on span count mismatch
  • No polling on ERROR state
  • Poll counter resets when switching between traces

Does this PR require documentation update?

  • No. You can skip the rest of this section.
  • Yes. I've updated:
    • Examples
    • API references
    • Instructions

Does this PR require updating the MLflow Skills repository?

  • No. You can skip the rest of this section.
  • Yes. Please link the corresponding PR or explain how you plan to update it.

Release Notes

Is this a user-facing change?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release notes for MLflow users.

Fix infinite fetch loop in trace detail view when num_spans metadata mismatches actual span count.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

  • area/tracking: Tracking Service, tracking client APIs, autologging
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/evaluation: MLflow model evaluation features, evaluation metrics, and evaluation workflows
  • area/gateway: MLflow AI Gateway client APIs, server, and third-party integrations
  • area/prompts: MLflow prompt engineering features, prompt templates, and prompt management
  • area/tracing: MLflow Tracing features, tracing APIs, and LLM tracing functionality
  • area/projects: MLproject format, project running backends
  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages

How should the PR be classified in the release notes? Choose one:

  • rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
  • rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
  • rn/feature - A new user-facing feature worth mentioning in the release notes
  • rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
  • rn/documentation - A user-facing documentation change worth mentioning in the release notes

Should this PR be included in the next patch release?

What is a minor/patch release?
  • Minor release: a release that increments the second part of the version number (e.g., 1.2.0 -> 1.3.0).
    Bug fixes, doc updates and new features usually go into minor releases.
  • Patch release: a release that increments the third part of the version number (e.g., 1.2.0 -> 1.2.1).
    Bug fixes and doc updates usually go into patch releases.
  • Yes (this PR will be cherry-picked and included in the next patch release)
  • No (this PR will be included in the next minor release)

@github-actions
Copy link
Contributor

github-actions bot commented Feb 5, 2026

🛠 DevTools 🛠

Install mlflow from this PR

# mlflow
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/20596/merge
# mlflow-skinny
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/20596/merge#subdirectory=libs/skinny

For Databricks, use the following command:

%sh curl -LsSf https://raw.githubusercontent.com/mlflow/mlflow/HEAD/dev/install-skinny.sh | sh -s pull/20596/merge

@github-actions
Copy link
Contributor

github-actions bot commented Feb 5, 2026

@coldzero94 Thank you for the contribution! Could you fix the following issue(s)?

⚠ Invalid PR template

This PR does not appear to have been filed using the MLflow PR template. Please copy the PR template from here and fill it out.

@github-actions github-actions bot added area/tracing MLflow Tracing and its integrations area/uiux Front-end, user experience, plotting, JavaScript, JavaScript dev server rn/bug-fix Mention under Bug Fixes in Changelogs. v3.9.1 labels Feb 5, 2026
@coldzero94 coldzero94 force-pushed the fix/infinite-trace-fetch-loop branch from 4f6b374 to 2af3cdd Compare February 5, 2026 11:26
@github-actions
Copy link
Contributor

github-actions bot commented Feb 5, 2026

Documentation preview for 11e200e is available at:

More info
  • Ignore this comment if this PR does not change the documentation.
  • The preview is updated when a new commit is pushed to this PR.
  • This comment was created by this workflow run.
  • The documentation was built by this workflow run.

// Note: Previously we compared span counts even for completed traces, but this caused
// infinite polling loops when num_spans metadata was inconsistent with actual span count.
// Since the trace is already complete, no more spans will arrive, so stop polling.
if (traceInfo?.state === 'ERROR' || traceInfo?.state === 'OK') {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we shouldn't rely on 'OK' status, as previously explained in line 53-55 comments that the status could be updated while some internal spans are in-parallel uploading.
I think we could use the other option you mentioned in the issue, set a retry timeout if status is 'OK', since in such case it shouldn't take longer for all child spans to be logged.

Copy link
Contributor Author

@coldzero94 coldzero94 Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback! You're right that we shouldn't immediately stop polling on OK state since child spans may still be uploading in parallel.

I've updated the approach to add a bounded retry timeout: when the trace is in OK state but span count still doesn't match, we continue polling for up to 60 attempts (~60s). This covers the worst-case OTLP BatchSpanProcessor delay (5s schedule delay + 30s export timeout = 35s). The poll counter is also reset when navigating between traces to prevent count leaking.

Note: The frontend currently calls the API with allow_partial=true, so the backend's span completeness check is bypassed entirely. A more robust long-term fix could be adding a spans_complete flag to the allow_partial=true response, so the frontend can use the backend's authoritative completeness check. But that would be a separate PR.

@serena-ruan serena-ruan self-assigned this Feb 6, 2026
…mismatches

The polling logic in useGetTrace continued indefinitely when the
num_spans metadata was inconsistent with the actual span count,
even after the trace reached a terminal state.

Add a bounded retry mechanism: after the trace reaches OK state,
continue polling for up to 30 attempts (~30s) waiting for remaining
child spans. If the count still doesn't match, stop polling to
prevent infinite loops from metadata inconsistencies.

This mirrors the backend's own retry approach in
sqlalchemy_store.get_trace() which also uses bounded retries.

Fixes mlflow#20595

Signed-off-by: Chan Young Lee <cyl0504@gmail.com>
@coldzero94 coldzero94 force-pushed the fix/infinite-trace-fetch-loop branch from 2af3cdd to ff8324a Compare February 6, 2026 07:00
Signed-off-by: Chan Young Lee <cyl0504@gmail.com>
Signed-off-by: Chan Young Lee <cyl0504@gmail.com>

// Polling should eventually stop due to max retry count (not run forever)
// Wait a few seconds to verify polling occurs but is bounded
await new Promise((resolve) => setTimeout(resolve, 3000));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use jest fake timers instead of real timers for testing purpose?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! Converted all tests to use Jest fake timers. All tests pass locally.

Copy link
Collaborator

@serena-ruan serena-ruan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the contribution :D Let's update tests and merge then!

Signed-off-by: Chan Young Lee <cyl0504@gmail.com>
@coldzero94
Copy link
Contributor Author

@serena-ruan Is it done? Can you check it?

Copy link
Collaborator

@daniellok-db daniellok-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@daniellok-db daniellok-db added this pull request to the merge queue Feb 20, 2026
Merged via the queue into mlflow:master with commit 1d3c6b0 Feb 20, 2026
26 checks passed
daniellok-db pushed a commit to daniellok-db/mlflow that referenced this pull request Feb 20, 2026
…mismatches (mlflow#20596)

Signed-off-by: Chan Young Lee <cyl0504@gmail.com>
Co-authored-by: Serena Ruan <82044803+serena-ruan@users.noreply.github.com>
daniellok-db pushed a commit that referenced this pull request Feb 20, 2026
…mismatches (#20596)

Signed-off-by: Chan Young Lee <cyl0504@gmail.com>
Co-authored-by: Serena Ruan <82044803+serena-ruan@users.noreply.github.com>
@coldzero94 coldzero94 deleted the fix/infinite-trace-fetch-loop branch February 20, 2026 12:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/tracing MLflow Tracing and its integrations area/uiux Front-end, user experience, plotting, JavaScript, JavaScript dev server rn/bug-fix Mention under Bug Fixes in Changelogs. size/M v3.10.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Infinite fetch loop in trace detail view when num_spans metadata mismatches actual span count

3 participants