Skip to content

Fix distributed tracing rendering and improve doc#21070

Merged
daniellok-db merged 2 commits intomlflow:masterfrom
B-Step62:fix-distributed-tracing
Mar 4, 2026
Merged

Fix distributed tracing rendering and improve doc#21070
daniellok-db merged 2 commits intomlflow:masterfrom
B-Step62:fix-distributed-tracing

Conversation

@B-Step62
Copy link
Collaborator

@B-Step62 B-Step62 commented Feb 23, 2026

What changes are proposed in this pull request?

Fixing an infinite reloading issue for distributed tracing.

Currently, the UI checks whether the span count in the trace info (mlflow.trace.sizeStats) matches with the actual number of spans in the trace. However, this never matches in distributed tracing case, because mlflow.trace.sizeStats is computed based on the number of spans created in the main service (where root span is created).

Let's say we have two service A and B where A calls B (just like the concept image added in this PR). When A generates 3 spans and B generates 2 spans, the total number of spans in the trace will be 5. However, mlflow.trace.sizeStats only count spans in service A (because it is computed at client side and service A does not know how many spans service B creates). Then the check will never match (3 != 5) and the UI keep reloading indefinitely.

   const expected = JSON.parse(traceStats).num_spans;
    const actual = data?.data?.spans?.length ?? 0;

    if (expected === actual) {
      return false;   // Stop re-loading
    }
    ...
    return 1000;  // Set re-load after 1 second
  };

To address the issue, this PR makes a simple change to replace exact match to less-than-equal. In distributed tracing case, the actual span count exceeds the expected count in span stats.

Also, updating documentation to provide better guidance.

How is this PR tested?

  • Existing unit/integration tests
  • New unit/integration tests
  • Manual tests

Tested distributed tracing does not trigger indefinite re-rendering

Screenshot 2026-02-23 at 12 53 01

Does this PR require documentation update?

  • No. You can skip the rest of this section.
  • Yes. I've updated:
    • Examples
    • API references
    • Instructions

Does this PR require updating the MLflow Skills repository?

  • No. You can skip the rest of this section.
  • Yes. Please link the corresponding PR or explain how you plan to update it.

Release Notes

Is this a user-facing change?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release notes for MLflow users.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

  • area/tracking: Tracking Service, tracking client APIs, autologging
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/evaluation: MLflow model evaluation features, evaluation metrics, and evaluation workflows
  • area/gateway: MLflow AI Gateway client APIs, server, and third-party integrations
  • area/prompts: MLflow prompt engineering features, prompt templates, and prompt management
  • area/tracing: MLflow Tracing features, tracing APIs, and LLM tracing functionality
  • area/projects: MLproject format, project running backends
  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages

How should the PR be classified in the release notes? Choose one:

  • rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
  • rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
  • rn/feature - A new user-facing feature worth mentioning in the release notes
  • rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
  • rn/documentation - A user-facing documentation change worth mentioning in the release notes

Should this PR be included in the next patch release?

Yes should be selected for bug fixes, documentation updates, and other small changes. No should be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.

What is a minor/patch release?
  • Minor release: a release that increments the second part of the version number (e.g., 1.2.0 -> 1.3.0).
    Bug fixes, doc updates and new features usually go into minor releases.
  • Patch release: a release that increments the third part of the version number (e.g., 1.2.0 -> 1.2.1).
    Bug fixes and doc updates usually go into patch releases.
  • Yes (this PR will be cherry-picked and included in the next patch release)
  • No (this PR will be included in the next minor release)

@github-actions github-actions bot added size/M area/docs Documentation issues area/tracing MLflow Tracing and its integrations rn/documentation Mention under Documentation Changes in Changelogs. labels Feb 23, 2026
@github-actions
Copy link
Contributor

🛠 DevTools 🛠

Install mlflow from this PR

# mlflow
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/21070/merge
# mlflow-skinny
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/21070/merge#subdirectory=libs/skinny

For Databricks, use the following command:

%sh curl -LsSf https://raw.githubusercontent.com/mlflow/mlflow/HEAD/dev/install-skinny.sh | sh -s pull/21070/merge

@B-Step62 B-Step62 force-pushed the fix-distributed-tracing branch from 18ca2a1 to f9a84e0 Compare February 23, 2026 05:25
Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>
@B-Step62 B-Step62 force-pushed the fix-distributed-tracing branch from f9a84e0 to fd47b32 Compare February 23, 2026 05:26
@github-actions
Copy link
Contributor

github-actions bot commented Feb 23, 2026

Documentation preview for c07df7d is available at:

Changed Pages (1)

More info
  • Ignore this comment if this PR does not change the documentation.
  • The preview is updated when a new commit is pushed to this PR.
  • This comment was created by this workflow run.
  • The documentation was built by this workflow run.

@B-Step62 B-Step62 added the team-review Trigger a team review request label Feb 24, 2026
@harupy harupy requested a review from Copilot February 24, 2026 08:06
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes trace polling behavior in the tracing UI for distributed traces and refreshes the distributed tracing documentation with clearer guidance and diagrams.

Changes:

  • Update trace polling stop condition to handle distributed traces where the actual span count can exceed the client-computed sizeStats.num_spans.
  • Expand distributed tracing docs with a “How it works” explanation, prerequisites, and a runnable two-service (FastAPI) example.
  • Add/refresh distributed tracing diagrams in the docs static assets.

Reviewed changes

Copilot reviewed 2 out of 5 changed files in this pull request and generated 5 comments.

File Description
mlflow/server/js/src/shared/web-shared/genai-traces-table/hooks/useGetTrace.tsx Adjusts polling logic to stop when actual >= expected to prevent unnecessary/incorrect polling for distributed traces.
docs/docs/genai/tracing/app-instrumentation/distributed-tracing.mdx Reworks and expands distributed tracing documentation, adds images, prerequisites, and examples.
docs/static/images/llms/tracing/distributed-tracing.png Updates/refreshes distributed tracing illustration asset.
docs/static/images/llms/tracing/distributed-tracing-flow.png Adds a distributed tracing flow diagram asset used by the updated docs.

Comment on lines +84 to 90
// Poll until the actual span count reaches at least the expected count.
// NB: In distributed tracing case, the actual span count exceeds the expected count because
// sizeStats only counts spans from the originating process.
if (actual >= expected) {
okStatePollCountRef.current = 0;
return false;
}
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MODERATE: This change introduces new behavior (stop polling when actual > expected) but there’s no unit test covering the distributed-tracing case where mlflow.trace.sizeStats.num_spans undercounts spans. Add a test asserting polling stops when spans.length exceeds num_spans (trace state OK).

Copilot uses AI. Check for mistakes.

```bash
export MLFLOW_TRACKING_URI="http://localhost:5000"
export MLFLOW_EXPERIMENT_ID="distributed-tracing-demo"
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MODERATE: MLFLOW_EXPERIMENT_ID is documented elsewhere as a numeric experiment ID (e.g., ...=<experiment-id>). Here it’s set to an experiment name ("distributed-tracing-demo"), which will misconfigure users’ environments. Use MLFLOW_EXPERIMENT_NAME="distributed-tracing-demo" or show a numeric ID for MLFLOW_EXPERIMENT_ID.

Suggested change
export MLFLOW_EXPERIMENT_ID="distributed-tracing-demo"
export MLFLOW_EXPERIMENT_NAME="distributed-tracing-demo"

Copilot uses AI. Check for mistakes.

Create a `server.py` file with the following content:

```python server.py
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 NIT: The fenced code block info string python server.py` doesn’t match the doc-site convention used elsewhere (e.g., python title="server.py"). Consider switching to title="server.py"` so the filename renders consistently in Docusaurus.

Suggested change
```python server.py
```python title="server.py"

Copilot uses AI. Check for mistakes.
When an agent calls an [MLflow AI Gateway](/genai/governance/ai-gateway/) endpoint with usage tracking enabled, the gateway automatically creates a trace for each request. If the agent also sends a `traceparent` header, the gateway will create a lightweight span under the agent's trace that links to the full gateway trace and includes token usage.

This means you get two complementary views:
An important note for AI Gateway integration is that the gateway trace and the agent trace are stored in **different** experiments, while they are linked together through a span attribute. The gateway trace is a full trace with request/response payloads and token usage, while the agent trace includes is a lightweight span `gateway/<endpoint_name>` that contains a link to the gateway trace and token usage in span attributes. This design allows aggregating gateway usage across experiments while avoiding storing duplicated payloads.
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 NIT: Grammar: “includes is a lightweight span” is ungrammatical and reads like a typo. It should be “includes a lightweight span”.

Suggested change
An important note for AI Gateway integration is that the gateway trace and the agent trace are stored in **different** experiments, while they are linked together through a span attribute. The gateway trace is a full trace with request/response payloads and token usage, while the agent trace includes is a lightweight span `gateway/<endpoint_name>` that contains a link to the gateway trace and token usage in span attributes. This design allows aggregating gateway usage across experiments while avoiding storing duplicated payloads.
An important note for AI Gateway integration is that the gateway trace and the agent trace are stored in **different** experiments, while they are linked together through a span attribute. The gateway trace is a full trace with request/response payloads and token usage, while the agent trace includes a lightweight span `gateway/<endpoint_name>` that contains a link to the gateway trace and token usage in span attributes. This design allows aggregating gateway usage across experiments while avoiding storing duplicated payloads.

Copilot uses AI. Check for mistakes.
Comment on lines +82 to +86
Create a `client.py` file with the following content:

```python client.py
import requests
import mlflow
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 NIT: The fenced code block info string python client.py` doesn’t match the doc-site convention used elsewhere (e.g., python title="client.py"). Consider switching to title="client.py"` so the filename renders consistently in Docusaurus.

Copilot uses AI. Check for mistakes.
Copy link
Member

@harupy harupy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once copilot comments are addressed! I believe some are legit.

// Poll until the actual span count reaches at least the expected count.
// NB: In distributed tracing case, the actual span count exceeds the expected count because
// sizeStats only counts spans from the originating process.
if (actual >= expected) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

q: is this the only side effect of sizeStats!=count(actual spans)? Should we have a backend logic to update the sizeStats field?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomeHirata I believe so. Having correct sizeStats is ideal, but how do we achieve that?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it hard to recompute the add_size_stats_to_trace_metadata on sqlalchemystore.log_spans like we do for cost metadata?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we know all the spans are exported to backend?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to know if all spans are exported? We can always update the stats when new spans are exported for the trace, which how aggregated token usage metadata is computed

if aggregated_token_usage:
trace_token_usage_record = (
session.query(SqlTraceMetadata)
.filter(
SqlTraceMetadata.request_id == trace_id,
SqlTraceMetadata.key == TraceMetadataKey.TOKEN_USAGE,
)
.one_or_none()
)
trace_token_usage = update_token_usage(
trace_token_usage_record.value if trace_token_usage_record else {},
aggregated_token_usage,
)
session.merge(
SqlTraceMetadata(
request_id=trace_id,
key=TraceMetadataKey.TOKEN_USAGE,
value=json.dumps(trace_token_usage),
)
)
# Store token usage as trace metrics
for key in TokenUsageKey.all_keys():
if (value := trace_token_usage.get(key)) is not None:
session.merge(
SqlTraceMetrics(request_id=trace_id, key=key, value=float(value))

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we keep the current logic for sizeStats and then update the metadata only when current sizeStats > sizeStats from the total spans? This way, we won't break the polling logic for normal traces while fixing the exact issue.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm we can do it for size, but it is quite challenging to do it for other stats like size percentile.

Copy link
Collaborator

@TomeHirata TomeHirata Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, cuz we need to know all span stats to compute percentiles? We could compute that, but it may introduce additional latency..

Copy link
Collaborator Author

@B-Step62 B-Step62 Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah... even updating count only adds an overhead in the backend; it adds one more read-and-write operation to trace metadata for every span logging. We do some updates for token/cost metadata but it is still not every span; only llm spans. The benefit doesn't seem to be worth of that cost, we will get accurate span count in sizeStats, but it is equivalent of len(spans).

Luckily we don't have many logic that relies on this metadata; only this frontend logic. Then maybe we can add a warning comment in the constant file to prevent any critical use case to depend on this in the future?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then maybe we can add a warning comment in the constant file to prevent any critical use case to depend on this in the future?

Yeah, let's go with this for now. If sizeStats is just len(spans), we may want to move away from this metadata as unreliable information can cause an issue in the long term.

@romtecmax
Copy link

Also - if you're already changing the backend logic, you could update the trace end time to when it was last changed.

In my case this happens:

  1. backend starts a trace
  2. backend ends the trace when handing off task to client
  3. client returns completed task with trace id
  4. backend adds new spans to the original trace using the passed trace id BUT the original trace end time stays the same

See image below
image

@B-Step62
Copy link
Collaborator Author

Great catch! Let me file a follow-up PR to fix the end time issue

Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>
Copy link
Collaborator

@TomeHirata TomeHirata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@daniellok-db
Copy link
Collaborator

merging for now, we can fix follow-ups in future PRs

@daniellok-db daniellok-db added this pull request to the merge queue Mar 4, 2026
Merged via the queue into mlflow:master with commit 5b6010e Mar 4, 2026
23 checks passed
daniellok-db pushed a commit to daniellok-db/mlflow that referenced this pull request Mar 5, 2026
Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>
daniellok-db pushed a commit to daniellok-db/mlflow that referenced this pull request Mar 5, 2026
Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>
daniellok-db pushed a commit that referenced this pull request Mar 5, 2026
Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/docs Documentation issues area/tracing MLflow Tracing and its integrations rn/documentation Mention under Documentation Changes in Changelogs. size/M team-review Trigger a team review request v3.10.1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants