Support batch span export to UC Table by B-Step62 · Pull Request #19324 · mlflow/mlflow

B-Step62 · 2025-12-11T01:05:16Z

🛠 DevTools 🛠

Install mlflow from this PR

# mlflow
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/19324/merge
# mlflow-skinny
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/19324/merge#subdirectory=libs/skinny

For Databricks, use the following command:

%sh curl -LsSf https://raw.githubusercontent.com/mlflow/mlflow/HEAD/dev/install-skinny.sh | sh -s pull/19324/merge

What changes are proposed in this pull request?

For Zerobus traces, the TraceExport endpoint handle batch of spans. However, we are not making use of it since the spans passed to the exporter is exported immediately (via async queue), namely, every span creates a new export request. This is highly inefficient and limit scalability.

This PR introduces a queue-based batching to the UC exporter. It is built as a pluggable component to the current async queue, rather than updating the existing queue directly, since it is used for handling multiple types of requests e.g. StartTrace request and other exporters.

The default is no-batch to reduce blast radius, but actually it might be ok to change given the product is private preview (easier to change now than after public release). Will revisit it with the team before merging.

How is this PR tested?

Existing unit/integration tests
New unit/integration tests
Manual tests

Does this PR require documentation update?

Def need to be in documentation once the feature becomes public.

Release Notes

Is this a user-facing change?

No. You can skip the rest of this section.
Yes. Give a description of this change to be included in the release notes for MLflow users.

Support span batching when exporting spans to Databricks UC tables.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

How should the PR be classified in the release notes? Choose one:

rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
rn/feature - A new user-facing feature worth mentioning in the release notes
rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
rn/documentation - A user-facing documentation change worth mentioning in the release notes

Should this PR be included in the next patch release?

Yes should be selected for bug fixes, documentation updates, and other small changes. No should be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.

What is a minor/patch release?

Minor release: a release that increments the second part of the version number (e.g., 1.2.0 -> 1.3.0).
Bug fixes, doc updates and new features usually go into minor releases.
Patch release: a release that increments the third part of the version number (e.g., 1.2.0 -> 1.2.1).
Bug fixes and doc updates usually go into patch releases.

Yes (this PR will be cherry-picked and included in the next patch release)
No (this PR will be included in the next minor release)

Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>

github-actions · 2025-12-11T01:13:39Z

Documentation preview for e43fd72 is available at:

https://pr-19324--mlflow-docs-preview.netlify.app/docs/latest/

More info

Ignore this comment if this PR does not change the documentation.
The preview is updated when a new commit is pushed to this PR.
This comment was created by this workflow run.
The documentation was built by this workflow run.

jamesbxwu · 2025-12-11T07:27:23Z

mlflow/tracing/export/span_batcher.py

+    Queue based batching processor for span export to Databricks Unity Catalog table.
+
+    Exposes two configuration knobs
+    - Max span batch size: The maximum number of spans to export in a single batch.


The span sizes may be massively skewed depending on the use case (e.g. 30MB for one use case while a few bytes for another). Therefore, the payload size would be preferable over just # of records per batch

That's fair, though the batch span processor of OpenTelemetry only support count based, so I'm not sure if there is strong demand here. There is a downside that we need to serialize spans here to know the actual payload size, which introduces overhead. The batch condition is determined in a single thread so it can be a bottleneck when the system throughput is high.

Chatted offline here. Let's start with the count based batching. If we end up seeing issues where individual batches are too large for some customers, they can

reduce the batch size

we can introduce a max payload size config in the future

serena-ruan · 2025-12-11T07:43:59Z

mlflow/environment_variables.py

+MLFLOW_ASYNC_TRACE_LOGGING_MAX_SPAN_BATCH_SIZE = _EnvironmentVariable(
+    "MLFLOW_ASYNC_TRACE_LOGGING_MAX_BATCH_SIZE", int, 1


Suggested change

MLFLOW_ASYNC_TRACE_LOGGING_MAX_SPAN_BATCH_SIZE = _EnvironmentVariable(

"MLFLOW_ASYNC_TRACE_LOGGING_MAX_BATCH_SIZE", int, 1

MLFLOW_ASYNC_TRACE_LOGGING_MAX_SPAN_BATCH_SIZE = _EnvironmentVariable(

"MLFLOW_ASYNC_TRACE_LOGGING_MAX_SPAN_BATCH_SIZE", int, 1

to be consistent

thx for the catch!

serena-ruan · 2025-12-11T07:47:19Z

mlflow/tracing/export/uc_table.py

+            for span in spans:
+                self._span_batcher.add_span(location=location, span=span)


Could we add add_spans method instead? (without the batch size env var) Before this change all spans are exported in one request, but if we add_span one by one then there will be number of span requests?

Could you elaborate? Previously we export spans that is passed to export(spans) together, however, it is practically just a single span because span.end() directly triggers export without batching.

Ah yeah that's true...
Does supporting BatchSpanProcessor also work?

Yeah I thought about it, but that's not non-trivial. We have lots of custom logic in current span processors, so I struggle adopting BatchSpanProcessor there. Also we need to move trace logging logic from exporter to our processors, because trace logging should not be batched together with span logging.

I see, thanks for the explanation!

mlflow/tracing/export/span_batcher.py

serena-ruan

LGTM!

Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>

B-Step62 added 2 commits December 11, 2025 01:39

Support batching span export to Databricks UC table

626f240

Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>

fix atexit

7a1d158

Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>

github-actions bot added area/tracing MLflow Tracing and its integrations rn/feature Mention under Features in Changelogs. labels Dec 11, 2025

B-Step62 requested review from harupy and serena-ruan December 11, 2025 04:37

jamesbxwu reviewed Dec 11, 2025

View reviewed changes

serena-ruan reviewed Dec 11, 2025

View reviewed changes

mlflow/tracing/export/span_batcher.py Show resolved Hide resolved

github-actions bot assigned serena-ruan Dec 11, 2025

serena-ruan approved these changes Dec 11, 2025

View reviewed changes

B-Step62 added the v3.8.0 label Dec 12, 2025

comments

e43fd72

Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>

B-Step62 added this pull request to the merge queue Dec 12, 2025

Merged via the queue into mlflow:master with commit e9ed779 Dec 12, 2025
54 checks passed

B-Step62 deleted the batch-span-export branch December 12, 2025 04:19

		MLFLOW_ASYNC_TRACE_LOGGING_MAX_SPAN_BATCH_SIZE = _EnvironmentVariable(
		"MLFLOW_ASYNC_TRACE_LOGGING_MAX_BATCH_SIZE", int, 1

		for span in spans:
		self._span_batcher.add_span(location=location, span=span)

Conversation

B-Step62 commented Dec 11, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Install mlflow from this PR

What changes are proposed in this pull request?

How is this PR tested?

Does this PR require documentation update?

Release Notes

Is this a user-facing change?

What component(s), interfaces, languages, and integrations does this PR affect?

How should the PR be classified in the release notes? Choose one:

Should this PR be included in the next patch release?

Uh oh!

github-actions bot commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jamesbxwu Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

B-Step62 Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jamesbxwu Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

serena-ruan Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

B-Step62 Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

serena-ruan Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

B-Step62 Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

serena-ruan Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

B-Step62 Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

serena-ruan Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

serena-ruan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

B-Step62 commented Dec 11, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Dec 11, 2025 •

edited

Loading

B-Step62 Dec 11, 2025 •

edited

Loading

B-Step62 Dec 11, 2025 •

edited

Loading

B-Step62 Dec 11, 2025 •

edited

Loading