Fix infinite fetch loop in trace detail view when num_spans metadata mismatches#20596
Conversation
🛠 DevTools 🛠
Install mlflow from this PRFor Databricks, use the following command: |
|
@coldzero94 Thank you for the contribution! Could you fix the following issue(s)? ⚠ Invalid PR templateThis PR does not appear to have been filed using the MLflow PR template. Please copy the PR template from here and fill it out. |
4f6b374 to
2af3cdd
Compare
|
Documentation preview for 11e200e is available at: More info
|
| // Note: Previously we compared span counts even for completed traces, but this caused | ||
| // infinite polling loops when num_spans metadata was inconsistent with actual span count. | ||
| // Since the trace is already complete, no more spans will arrive, so stop polling. | ||
| if (traceInfo?.state === 'ERROR' || traceInfo?.state === 'OK') { |
There was a problem hiding this comment.
I think we shouldn't rely on 'OK' status, as previously explained in line 53-55 comments that the status could be updated while some internal spans are in-parallel uploading.
I think we could use the other option you mentioned in the issue, set a retry timeout if status is 'OK', since in such case it shouldn't take longer for all child spans to be logged.
There was a problem hiding this comment.
Thanks for the feedback! You're right that we shouldn't immediately stop polling on OK state since child spans may still be uploading in parallel.
I've updated the approach to add a bounded retry timeout: when the trace is in OK state but span count still doesn't match, we continue polling for up to 60 attempts (~60s). This covers the worst-case OTLP BatchSpanProcessor delay (5s schedule delay + 30s export timeout = 35s). The poll counter is also reset when navigating between traces to prevent count leaking.
Note: The frontend currently calls the API with allow_partial=true, so the backend's span completeness check is bypassed entirely. A more robust long-term fix could be adding a spans_complete flag to the allow_partial=true response, so the frontend can use the backend's authoritative completeness check. But that would be a separate PR.
…mismatches The polling logic in useGetTrace continued indefinitely when the num_spans metadata was inconsistent with the actual span count, even after the trace reached a terminal state. Add a bounded retry mechanism: after the trace reaches OK state, continue polling for up to 30 attempts (~30s) waiting for remaining child spans. If the count still doesn't match, stop polling to prevent infinite loops from metadata inconsistencies. This mirrors the backend's own retry approach in sqlalchemy_store.get_trace() which also uses bounded retries. Fixes mlflow#20595 Signed-off-by: Chan Young Lee <cyl0504@gmail.com>
2af3cdd to
ff8324a
Compare
Signed-off-by: Chan Young Lee <cyl0504@gmail.com>
Signed-off-by: Chan Young Lee <cyl0504@gmail.com>
|
|
||
| // Polling should eventually stop due to max retry count (not run forever) | ||
| // Wait a few seconds to verify polling occurs but is bounded | ||
| await new Promise((resolve) => setTimeout(resolve, 3000)); |
There was a problem hiding this comment.
Could we use jest fake timers instead of real timers for testing purpose?
There was a problem hiding this comment.
Done! Converted all tests to use Jest fake timers. All tests pass locally.
serena-ruan
left a comment
There was a problem hiding this comment.
LGTM! Thanks for the contribution :D Let's update tests and merge then!
Signed-off-by: Chan Young Lee <cyl0504@gmail.com>
|
@serena-ruan Is it done? Can you check it? |
…mismatches (mlflow#20596) Signed-off-by: Chan Young Lee <cyl0504@gmail.com> Co-authored-by: Serena Ruan <82044803+serena-ruan@users.noreply.github.com>
…mismatches (#20596) Signed-off-by: Chan Young Lee <cyl0504@gmail.com> Co-authored-by: Serena Ruan <82044803+serena-ruan@users.noreply.github.com>
Related Issues/PRs
Resolve #20595
What changes are proposed in this pull request?
Fix infinite fetch loop that occurs when viewing trace details if the
mlflow.trace.sizeStats.num_spansmetadata doesn't match the actual span count.Problem: The polling logic in
useGetTrace.tsxcontinued polling indefinitely after a trace reached theOKstate, as long as the expected span count (from metadata) didn't match the actual span count. This caused:Root cause: The
num_spansmetadata can become inconsistent with actual data (e.g., metadata says 33 spans, but only 32 exist in DB). Since the frontend calls the API withallow_partial=true, the backend's span completeness check is bypassed entirely.Solution: Add a bounded retry timeout for polling when the trace is in
OKstate but span count still doesn't match:ERRORstateHow is this PR tested?
Added unit tests for:
Does this PR require documentation update?
Does this PR require updating the MLflow Skills repository?
Release Notes
Is this a user-facing change?
Fix infinite fetch loop in trace detail view when
num_spansmetadata mismatches actual span count.What component(s), interfaces, languages, and integrations does this PR affect?
Components
area/tracking: Tracking Service, tracking client APIs, autologgingarea/models: MLmodel format, model serialization/deserialization, flavorsarea/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registryarea/scoring: MLflow Model server, model deployment tools, Spark UDFsarea/evaluation: MLflow model evaluation features, evaluation metrics, and evaluation workflowsarea/gateway: MLflow AI Gateway client APIs, server, and third-party integrationsarea/prompts: MLflow prompt engineering features, prompt templates, and prompt managementarea/tracing: MLflow Tracing features, tracing APIs, and LLM tracing functionalityarea/projects: MLproject format, project running backendsarea/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev serverarea/build: Build and test infrastructure for MLflowarea/docs: MLflow documentation pagesHow should the PR be classified in the release notes? Choose one:
rn/none- No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" sectionrn/breaking-change- The PR will be mentioned in the "Breaking Changes" sectionrn/feature- A new user-facing feature worth mentioning in the release notesrn/bug-fix- A user-facing bug fix worth mentioning in the release notesrn/documentation- A user-facing documentation change worth mentioning in the release notesShould this PR be included in the next patch release?
What is a minor/patch release?
Bug fixes, doc updates and new features usually go into minor releases.
Bug fixes and doc updates usually go into patch releases.