Fix distributed tracing rendering and improve doc by B-Step62 · Pull Request #21070 · mlflow/mlflow

B-Step62 · 2026-02-23T05:21:33Z

What changes are proposed in this pull request?

Fixing an infinite reloading issue for distributed tracing.

Currently, the UI checks whether the span count in the trace info (mlflow.trace.sizeStats) matches with the actual number of spans in the trace. However, this never matches in distributed tracing case, because mlflow.trace.sizeStats is computed based on the number of spans created in the main service (where root span is created).

Let's say we have two service A and B where A calls B (just like the concept image added in this PR). When A generates 3 spans and B generates 2 spans, the total number of spans in the trace will be 5. However, mlflow.trace.sizeStats only count spans in service A (because it is computed at client side and service A does not know how many spans service B creates). Then the check will never match (3 != 5) and the UI keep reloading indefinitely.

   const expected = JSON.parse(traceStats).num_spans;
    const actual = data?.data?.spans?.length ?? 0;

    if (expected === actual) {
      return false;   // Stop re-loading
    }
    ...
    return 1000;  // Set re-load after 1 second
  };

To address the issue, this PR makes a simple change to replace exact match to less-than-equal. In distributed tracing case, the actual span count exceeds the expected count in span stats.

Also, updating documentation to provide better guidance.

How is this PR tested?

Existing unit/integration tests
New unit/integration tests
Manual tests

Tested distributed tracing does not trigger indefinite re-rendering

Does this PR require documentation update?

Does this PR require updating the MLflow Skills repository?

No. You can skip the rest of this section.
Yes. Please link the corresponding PR or explain how you plan to update it.

Release Notes

Is this a user-facing change?

No. You can skip the rest of this section.
Yes. Give a description of this change to be included in the release notes for MLflow users.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

How should the PR be classified in the release notes? Choose one:

rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
rn/feature - A new user-facing feature worth mentioning in the release notes
rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
rn/documentation - A user-facing documentation change worth mentioning in the release notes

Should this PR be included in the next patch release?

Yes should be selected for bug fixes, documentation updates, and other small changes. No should be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.

What is a minor/patch release?

Minor release: a release that increments the second part of the version number (e.g., 1.2.0 -> 1.3.0).
Bug fixes, doc updates and new features usually go into minor releases.
Patch release: a release that increments the third part of the version number (e.g., 1.2.0 -> 1.2.1).
Bug fixes and doc updates usually go into patch releases.

Yes (this PR will be cherry-picked and included in the next patch release)
No (this PR will be included in the next minor release)

github-actions · 2026-02-23T05:21:47Z

🛠 DevTools 🛠

Install mlflow from this PR

# mlflow
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/21070/merge
# mlflow-skinny
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/21070/merge#subdirectory=libs/skinny

For Databricks, use the following command:

%sh curl -LsSf https://raw.githubusercontent.com/mlflow/mlflow/HEAD/dev/install-skinny.sh | sh -s pull/21070/merge

Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>

github-actions · 2026-02-23T05:36:33Z

Documentation preview for c07df7d is available at:

https://pr-21070--mlflow-docs-preview.netlify.app/docs/latest/

Changed Pages (1)

genai/tracing/app-instrumentation/distributed-tracing (modified)

More info

Ignore this comment if this PR does not change the documentation.
The preview is updated when a new commit is pushed to this PR.
This comment was created by this workflow run.
The documentation was built by this workflow run.

Copilot

Pull request overview

Fixes trace polling behavior in the tracing UI for distributed traces and refreshes the distributed tracing documentation with clearer guidance and diagrams.

Changes:

Update trace polling stop condition to handle distributed traces where the actual span count can exceed the client-computed sizeStats.num_spans.
Expand distributed tracing docs with a “How it works” explanation, prerequisites, and a runnable two-service (FastAPI) example.
Add/refresh distributed tracing diagrams in the docs static assets.

Reviewed changes

Copilot reviewed 2 out of 5 changed files in this pull request and generated 5 comments.

File	Description
mlflow/server/js/src/shared/web-shared/genai-traces-table/hooks/useGetTrace.tsx	Adjusts polling logic to stop when `actual >= expected` to prevent unnecessary/incorrect polling for distributed traces.
docs/docs/genai/tracing/app-instrumentation/distributed-tracing.mdx	Reworks and expands distributed tracing documentation, adds images, prerequisites, and examples.
docs/static/images/llms/tracing/distributed-tracing.png	Updates/refreshes distributed tracing illustration asset.
docs/static/images/llms/tracing/distributed-tracing-flow.png	Adds a distributed tracing flow diagram asset used by the updated docs.

Copilot · 2026-02-24T08:10:37Z

mlflow/server/js/src/shared/web-shared/genai-traces-table/hooks/useGetTrace.tsx

+    // Poll until the actual span count reaches at least the expected count.
+    // NB: In distributed tracing case, the actual span count exceeds the expected count because
+    // sizeStats only counts spans from the originating process.
+    if (actual >= expected) {
      okStatePollCountRef.current = 0;
      return false;
    }


🟡 MODERATE: This change introduces new behavior (stop polling when actual > expected) but there’s no unit test covering the distributed-tracing case where mlflow.trace.sizeStats.num_spans undercounts spans. Add a test asserting polling stops when spans.length exceeds num_spans (trace state OK).

Copilot · 2026-02-24T08:10:38Z

docs/docs/genai/tracing/app-instrumentation/distributed-tracing.mdx

+
+```bash
+export MLFLOW_TRACKING_URI="http://localhost:5000"
+export MLFLOW_EXPERIMENT_ID="distributed-tracing-demo"


🟡 MODERATE: MLFLOW_EXPERIMENT_ID is documented elsewhere as a numeric experiment ID (e.g., ...=<experiment-id>). Here it’s set to an experiment name ("distributed-tracing-demo"), which will misconfigure users’ environments. Use MLFLOW_EXPERIMENT_NAME="distributed-tracing-demo" or show a numeric ID for MLFLOW_EXPERIMENT_ID.

Suggested change

export MLFLOW_EXPERIMENT_ID="distributed-tracing-demo"

export MLFLOW_EXPERIMENT_NAME="distributed-tracing-demo"

Copilot · 2026-02-24T08:10:38Z

docs/docs/genai/tracing/app-instrumentation/distributed-tracing.mdx

+
+Create a `server.py` file with the following content:
+
+```python server.py


🟢 NIT: The fenced code block info string python server.py` doesn’t match the doc-site convention used elsewhere (e.g., python title="server.py"). Consider switching to title="server.py"` so the filename renders consistently in Docusaurus.

Suggested change

```python server.py

```python title="server.py"

Copilot · 2026-02-24T08:10:38Z

docs/docs/genai/tracing/app-instrumentation/distributed-tracing.mdx

 When an agent calls an [MLflow AI Gateway](/genai/governance/ai-gateway/) endpoint with usage tracking enabled, the gateway automatically creates a trace for each request. If the agent also sends a `traceparent` header, the gateway will create a lightweight span under the agent's trace that links to the full gateway trace and includes token usage.

-This means you get two complementary views:
+An important note for AI Gateway integration is that the gateway trace and the agent trace are stored in **different** experiments, while they are linked together through a span attribute. The gateway trace is a full trace with request/response payloads and token usage, while the agent trace includes is a lightweight span `gateway/<endpoint_name>` that contains a link to the gateway trace and token usage in span attributes. This design allows aggregating gateway usage across experiments while avoiding storing duplicated payloads.


🟢 NIT: Grammar: “includes is a lightweight span” is ungrammatical and reads like a typo. It should be “includes a lightweight span”.

Suggested change

An important note for AI Gateway integration is that the gateway trace and the agent trace are stored in **different** experiments, while they are linked together through a span attribute. The gateway trace is a full trace with request/response payloads and token usage, while the agent trace includes is a lightweight span `gateway/<endpoint_name>` that contains a link to the gateway trace and token usage in span attributes. This design allows aggregating gateway usage across experiments while avoiding storing duplicated payloads.

An important note for AI Gateway integration is that the gateway trace and the agent trace are stored in **different** experiments, while they are linked together through a span attribute. The gateway trace is a full trace with request/response payloads and token usage, while the agent trace includes a lightweight span `gateway/<endpoint_name>` that contains a link to the gateway trace and token usage in span attributes. This design allows aggregating gateway usage across experiments while avoiding storing duplicated payloads.

Copilot · 2026-02-24T08:10:38Z

docs/docs/genai/tracing/app-instrumentation/distributed-tracing.mdx

+Create a `client.py` file with the following content:
+
+```python client.py
+import requests
+import mlflow


🟢 NIT: The fenced code block info string python client.py` doesn’t match the doc-site convention used elsewhere (e.g., python title="client.py"). Consider switching to title="client.py"` so the filename renders consistently in Docusaurus.

harupy

LGTM once copilot comments are addressed! I believe some are legit.

TomeHirata · 2026-02-24T08:27:46Z

mlflow/server/js/src/shared/web-shared/genai-traces-table/hooks/useGetTrace.tsx

+    // Poll until the actual span count reaches at least the expected count.
+    // NB: In distributed tracing case, the actual span count exceeds the expected count because
+    // sizeStats only counts spans from the originating process.
+    if (actual >= expected) {


q: is this the only side effect of sizeStats!=count(actual spans)? Should we have a backend logic to update the sizeStats field?

@TomeHirata I believe so. Having correct sizeStats is ideal, but how do we achieve that?

Is it hard to recompute the add_size_stats_to_trace_metadata on sqlalchemystore.log_spans like we do for cost metadata?

How do we know all the spans are exported to backend?

Do we need to know if all spans are exported? We can always update the stats when new spans are exported for the trace, which how aggregated token usage metadata is computed

mlflow/mlflow/store/tracking/sqlalchemy_store.py

Lines 4461 to 4487 in 0b0686a

if aggregated_token_usage:

trace_token_usage_record = (

session.query(SqlTraceMetadata)

.filter(

SqlTraceMetadata.request_id == trace_id,

SqlTraceMetadata.key == TraceMetadataKey.TOKEN_USAGE,

)

.one_or_none()

)

trace_token_usage = update_token_usage(

trace_token_usage_record.value if trace_token_usage_record else {},

aggregated_token_usage,

)

session.merge(

SqlTraceMetadata(

request_id=trace_id,

key=TraceMetadataKey.TOKEN_USAGE,

value=json.dumps(trace_token_usage),

)

)

# Store token usage as trace metrics

for key in TokenUsageKey.all_keys():

if (value := trace_token_usage.get(key)) is not None:

session.merge(

SqlTraceMetrics(request_id=trace_id, key=key, value=float(value))

Can we keep the current logic for sizeStats and then update the metadata only when current sizeStats > sizeStats from the total spans? This way, we won't break the polling logic for normal traces while fixing the exact issue.

Hmm we can do it for size, but it is quite challenging to do it for other stats like size percentile.

I see, cuz we need to know all span stats to compute percentiles? We could compute that, but it may introduce additional latency..

Yeah... even updating count only adds an overhead in the backend; it adds one more read-and-write operation to trace metadata for every span logging. We do some updates for token/cost metadata but it is still not every span; only llm spans. The benefit doesn't seem to be worth of that cost, we will get accurate span count in sizeStats, but it is equivalent of len(spans).

Luckily we don't have many logic that relies on this metadata; only this frontend logic. Then maybe we can add a warning comment in the constant file to prevent any critical use case to depend on this in the future?

Then maybe we can add a warning comment in the constant file to prevent any critical use case to depend on this in the future?

Yeah, let's go with this for now. If sizeStats is just len(spans), we may want to move away from this metadata as unreliable information can cause an issue in the long term.

romtecmax · 2026-02-24T20:08:01Z

Also - if you're already changing the backend logic, you could update the trace end time to when it was last changed.

In my case this happens:

backend starts a trace
backend ends the trace when handing off task to client
client returns completed task with trace id
backend adds new spans to the original trace using the passed trace id BUT the original trace end time stays the same

See image below

B-Step62 · 2026-02-26T10:28:27Z

Great catch! Let me file a follow-up PR to fix the end time issue

Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>

TomeHirata

LGTM

daniellok-db · 2026-03-04T03:15:03Z

merging for now, we can fix follow-ups in future PRs

Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>

github-actions bot added size/M area/docs Documentation issues area/tracing MLflow Tracing and its integrations rn/documentation Mention under Documentation Changes in Changelogs. labels Feb 23, 2026

github-actions bot added the v3.10.1 label Feb 23, 2026

B-Step62 force-pushed the fix-distributed-tracing branch from 18ca2a1 to f9a84e0 Compare February 23, 2026 05:25

Fix distributed tracing rendering and improve doc

fd47b32

Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>

B-Step62 force-pushed the fix-distributed-tracing branch from f9a84e0 to fd47b32 Compare February 23, 2026 05:26

B-Step62 added the team-review Trigger a team review request label Feb 24, 2026

github-actions bot requested review from TomeHirata and harupy February 24, 2026 07:40

harupy requested a review from Copilot February 24, 2026 08:06

Copilot started reviewing on behalf of harupy February 24, 2026 08:06 View session

Copilot AI reviewed Feb 24, 2026

View reviewed changes

harupy approved these changes Feb 24, 2026

View reviewed changes

TomeHirata reviewed Feb 24, 2026

View reviewed changes

github-actions bot assigned harupy and TomeHirata Feb 24, 2026

comments

c07df7d

Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>

TomeHirata approved these changes Feb 27, 2026

View reviewed changes

daniellok-db added this pull request to the merge queue Mar 4, 2026

Merged via the queue into mlflow:master with commit 5b6010e Mar 4, 2026
23 checks passed

daniellok-db pushed a commit to daniellok-db/mlflow that referenced this pull request Mar 5, 2026

Fix distributed tracing rendering and improve doc (mlflow#21070)

146a8d8

Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>

daniellok-db pushed a commit to daniellok-db/mlflow that referenced this pull request Mar 5, 2026

Fix distributed tracing rendering and improve doc (mlflow#21070)

7dcbdf1

Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>

daniellok-db pushed a commit that referenced this pull request Mar 5, 2026

Fix distributed tracing rendering and improve doc (#21070)

ac6e22f

Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>

	export MLFLOW_EXPERIMENT_ID="distributed-tracing-demo"
	export MLFLOW_EXPERIMENT_NAME="distributed-tracing-demo"


		Create a `server.py` file with the following content:

		```python server.py

	if aggregated_token_usage:
	trace_token_usage_record = (
	session.query(SqlTraceMetadata)
	.filter(
	SqlTraceMetadata.request_id == trace_id,
	SqlTraceMetadata.key == TraceMetadataKey.TOKEN_USAGE,
	)
	.one_or_none()
	)
	trace_token_usage = update_token_usage(
	trace_token_usage_record.value if trace_token_usage_record else {},
	aggregated_token_usage,
	)

	session.merge(
	SqlTraceMetadata(
	request_id=trace_id,
	key=TraceMetadataKey.TOKEN_USAGE,
	value=json.dumps(trace_token_usage),
	)
	)

	# Store token usage as trace metrics
	for key in TokenUsageKey.all_keys():
	if (value := trace_token_usage.get(key)) is not None:
	session.merge(
	SqlTraceMetrics(request_id=trace_id, key=key, value=float(value))

Conversation

B-Step62 commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes are proposed in this pull request?

How is this PR tested?

Does this PR require documentation update?

Does this PR require updating the MLflow Skills repository?

Release Notes

Is this a user-facing change?

What component(s), interfaces, languages, and integrations does this PR affect?

How should the PR be classified in the release notes? Choose one:

Should this PR be included in the next patch release?

Uh oh!

github-actions bot commented Feb 23, 2026

Install mlflow from this PR

Uh oh!

github-actions bot commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

harupy left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomeHirata Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

B-Step62 Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

romtecmax commented Feb 24, 2026

Uh oh!

B-Step62 commented Feb 26, 2026

Uh oh!

TomeHirata left a comment

Choose a reason for hiding this comment

Uh oh!

daniellok-db commented Mar 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

B-Step62 commented Feb 23, 2026 •

edited

Loading

github-actions bot commented Feb 23, 2026 •

edited

Loading

harupy left a comment •

edited

Loading

TomeHirata Feb 27, 2026 •

edited

Loading

B-Step62 Feb 27, 2026 •

edited

Loading