Fix distributed tracing rendering and improve doc#21070
Fix distributed tracing rendering and improve doc#21070daniellok-db merged 2 commits intomlflow:masterfrom
Conversation
🛠 DevTools 🛠
Install mlflow from this PRFor Databricks, use the following command: |
18ca2a1 to
f9a84e0
Compare
Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>
f9a84e0 to
fd47b32
Compare
|
Documentation preview for c07df7d is available at: Changed Pages (1) More info
|
There was a problem hiding this comment.
Pull request overview
Fixes trace polling behavior in the tracing UI for distributed traces and refreshes the distributed tracing documentation with clearer guidance and diagrams.
Changes:
- Update trace polling stop condition to handle distributed traces where the actual span count can exceed the client-computed
sizeStats.num_spans. - Expand distributed tracing docs with a “How it works” explanation, prerequisites, and a runnable two-service (FastAPI) example.
- Add/refresh distributed tracing diagrams in the docs static assets.
Reviewed changes
Copilot reviewed 2 out of 5 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| mlflow/server/js/src/shared/web-shared/genai-traces-table/hooks/useGetTrace.tsx | Adjusts polling logic to stop when actual >= expected to prevent unnecessary/incorrect polling for distributed traces. |
| docs/docs/genai/tracing/app-instrumentation/distributed-tracing.mdx | Reworks and expands distributed tracing documentation, adds images, prerequisites, and examples. |
| docs/static/images/llms/tracing/distributed-tracing.png | Updates/refreshes distributed tracing illustration asset. |
| docs/static/images/llms/tracing/distributed-tracing-flow.png | Adds a distributed tracing flow diagram asset used by the updated docs. |
| // Poll until the actual span count reaches at least the expected count. | ||
| // NB: In distributed tracing case, the actual span count exceeds the expected count because | ||
| // sizeStats only counts spans from the originating process. | ||
| if (actual >= expected) { | ||
| okStatePollCountRef.current = 0; | ||
| return false; | ||
| } |
There was a problem hiding this comment.
🟡 MODERATE: This change introduces new behavior (stop polling when actual > expected) but there’s no unit test covering the distributed-tracing case where mlflow.trace.sizeStats.num_spans undercounts spans. Add a test asserting polling stops when spans.length exceeds num_spans (trace state OK).
|
|
||
| ```bash | ||
| export MLFLOW_TRACKING_URI="http://localhost:5000" | ||
| export MLFLOW_EXPERIMENT_ID="distributed-tracing-demo" |
There was a problem hiding this comment.
🟡 MODERATE: MLFLOW_EXPERIMENT_ID is documented elsewhere as a numeric experiment ID (e.g., ...=<experiment-id>). Here it’s set to an experiment name ("distributed-tracing-demo"), which will misconfigure users’ environments. Use MLFLOW_EXPERIMENT_NAME="distributed-tracing-demo" or show a numeric ID for MLFLOW_EXPERIMENT_ID.
| export MLFLOW_EXPERIMENT_ID="distributed-tracing-demo" | |
| export MLFLOW_EXPERIMENT_NAME="distributed-tracing-demo" |
|
|
||
| Create a `server.py` file with the following content: | ||
|
|
||
| ```python server.py |
There was a problem hiding this comment.
🟢 NIT: The fenced code block info string python server.py` doesn’t match the doc-site convention used elsewhere (e.g., python title="server.py"). Consider switching to title="server.py"` so the filename renders consistently in Docusaurus.
| ```python server.py | |
| ```python title="server.py" |
| When an agent calls an [MLflow AI Gateway](/genai/governance/ai-gateway/) endpoint with usage tracking enabled, the gateway automatically creates a trace for each request. If the agent also sends a `traceparent` header, the gateway will create a lightweight span under the agent's trace that links to the full gateway trace and includes token usage. | ||
|
|
||
| This means you get two complementary views: | ||
| An important note for AI Gateway integration is that the gateway trace and the agent trace are stored in **different** experiments, while they are linked together through a span attribute. The gateway trace is a full trace with request/response payloads and token usage, while the agent trace includes is a lightweight span `gateway/<endpoint_name>` that contains a link to the gateway trace and token usage in span attributes. This design allows aggregating gateway usage across experiments while avoiding storing duplicated payloads. |
There was a problem hiding this comment.
🟢 NIT: Grammar: “includes is a lightweight span” is ungrammatical and reads like a typo. It should be “includes a lightweight span”.
| An important note for AI Gateway integration is that the gateway trace and the agent trace are stored in **different** experiments, while they are linked together through a span attribute. The gateway trace is a full trace with request/response payloads and token usage, while the agent trace includes is a lightweight span `gateway/<endpoint_name>` that contains a link to the gateway trace and token usage in span attributes. This design allows aggregating gateway usage across experiments while avoiding storing duplicated payloads. | |
| An important note for AI Gateway integration is that the gateway trace and the agent trace are stored in **different** experiments, while they are linked together through a span attribute. The gateway trace is a full trace with request/response payloads and token usage, while the agent trace includes a lightweight span `gateway/<endpoint_name>` that contains a link to the gateway trace and token usage in span attributes. This design allows aggregating gateway usage across experiments while avoiding storing duplicated payloads. |
| Create a `client.py` file with the following content: | ||
|
|
||
| ```python client.py | ||
| import requests | ||
| import mlflow |
There was a problem hiding this comment.
🟢 NIT: The fenced code block info string python client.py` doesn’t match the doc-site convention used elsewhere (e.g., python title="client.py"). Consider switching to title="client.py"` so the filename renders consistently in Docusaurus.
| // Poll until the actual span count reaches at least the expected count. | ||
| // NB: In distributed tracing case, the actual span count exceeds the expected count because | ||
| // sizeStats only counts spans from the originating process. | ||
| if (actual >= expected) { |
There was a problem hiding this comment.
q: is this the only side effect of sizeStats!=count(actual spans)? Should we have a backend logic to update the sizeStats field?
There was a problem hiding this comment.
@TomeHirata I believe so. Having correct sizeStats is ideal, but how do we achieve that?
There was a problem hiding this comment.
Is it hard to recompute the add_size_stats_to_trace_metadata on sqlalchemystore.log_spans like we do for cost metadata?
There was a problem hiding this comment.
How do we know all the spans are exported to backend?
There was a problem hiding this comment.
Do we need to know if all spans are exported? We can always update the stats when new spans are exported for the trace, which how aggregated token usage metadata is computed
mlflow/mlflow/store/tracking/sqlalchemy_store.py
Lines 4461 to 4487 in 0b0686a
There was a problem hiding this comment.
Can we keep the current logic for sizeStats and then update the metadata only when current sizeStats > sizeStats from the total spans? This way, we won't break the polling logic for normal traces while fixing the exact issue.
There was a problem hiding this comment.
Hmm we can do it for size, but it is quite challenging to do it for other stats like size percentile.
There was a problem hiding this comment.
I see, cuz we need to know all span stats to compute percentiles? We could compute that, but it may introduce additional latency..
There was a problem hiding this comment.
Yeah... even updating count only adds an overhead in the backend; it adds one more read-and-write operation to trace metadata for every span logging. We do some updates for token/cost metadata but it is still not every span; only llm spans. The benefit doesn't seem to be worth of that cost, we will get accurate span count in sizeStats, but it is equivalent of len(spans).
Luckily we don't have many logic that relies on this metadata; only this frontend logic. Then maybe we can add a warning comment in the constant file to prevent any critical use case to depend on this in the future?
There was a problem hiding this comment.
Then maybe we can add a warning comment in the constant file to prevent any critical use case to depend on this in the future?
Yeah, let's go with this for now. If sizeStats is just len(spans), we may want to move away from this metadata as unreliable information can cause an issue in the long term.
|
Also - if you're already changing the backend logic, you could update the trace end time to when it was last changed. In my case this happens:
|
|
Great catch! Let me file a follow-up PR to fix the end time issue |
|
merging for now, we can fix follow-ups in future PRs |
Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>
Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>
Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>

What changes are proposed in this pull request?
Fixing an infinite reloading issue for distributed tracing.
Currently, the UI checks whether the span count in the trace info (
mlflow.trace.sizeStats) matches with the actual number of spans in the trace. However, this never matches in distributed tracing case, becausemlflow.trace.sizeStatsis computed based on the number of spans created in the main service (where root span is created).Let's say we have two service A and B where A calls B (just like the concept image added in this PR). When A generates 3 spans and B generates 2 spans, the total number of spans in the trace will be 5. However,
mlflow.trace.sizeStatsonly count spans in service A (because it is computed at client side and service A does not know how many spans service B creates). Then the check will never match (3 != 5) and the UI keep reloading indefinitely.To address the issue, this PR makes a simple change to replace exact match to less-than-equal. In distributed tracing case, the actual span count exceeds the expected count in span stats.
Also, updating documentation to provide better guidance.
How is this PR tested?
Tested distributed tracing does not trigger indefinite re-rendering
Does this PR require documentation update?
Does this PR require updating the MLflow Skills repository?
Release Notes
Is this a user-facing change?
What component(s), interfaces, languages, and integrations does this PR affect?
Components
area/tracking: Tracking Service, tracking client APIs, autologgingarea/models: MLmodel format, model serialization/deserialization, flavorsarea/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registryarea/scoring: MLflow Model server, model deployment tools, Spark UDFsarea/evaluation: MLflow model evaluation features, evaluation metrics, and evaluation workflowsarea/gateway: MLflow AI Gateway client APIs, server, and third-party integrationsarea/prompts: MLflow prompt engineering features, prompt templates, and prompt managementarea/tracing: MLflow Tracing features, tracing APIs, and LLM tracing functionalityarea/projects: MLproject format, project running backendsarea/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev serverarea/build: Build and test infrastructure for MLflowarea/docs: MLflow documentation pagesHow should the PR be classified in the release notes? Choose one:
rn/none- No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" sectionrn/breaking-change- The PR will be mentioned in the "Breaking Changes" sectionrn/feature- A new user-facing feature worth mentioning in the release notesrn/bug-fix- A user-facing bug fix worth mentioning in the release notesrn/documentation- A user-facing documentation change worth mentioning in the release notesShould this PR be included in the next patch release?
Yesshould be selected for bug fixes, documentation updates, and other small changes.Noshould be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.What is a minor/patch release?
Bug fixes, doc updates and new features usually go into minor releases.
Bug fixes and doc updates usually go into patch releases.