Skip to content

Support batch span export to UC Table#19324

Merged
B-Step62 merged 3 commits intomlflow:masterfrom
B-Step62:batch-span-export
Dec 12, 2025
Merged

Support batch span export to UC Table#19324
B-Step62 merged 3 commits intomlflow:masterfrom
B-Step62:batch-span-export

Conversation

@B-Step62
Copy link
Collaborator

@B-Step62 B-Step62 commented Dec 11, 2025

🛠 DevTools 🛠

Open in GitHub Codespaces

Install mlflow from this PR

# mlflow
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/19324/merge
# mlflow-skinny
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/19324/merge#subdirectory=libs/skinny

For Databricks, use the following command:

%sh curl -LsSf https://raw.githubusercontent.com/mlflow/mlflow/HEAD/dev/install-skinny.sh | sh -s pull/19324/merge

What changes are proposed in this pull request?

For Zerobus traces, the TraceExport endpoint handle batch of spans. However, we are not making use of it since the spans passed to the exporter is exported immediately (via async queue), namely, every span creates a new export request. This is highly inefficient and limit scalability.

This PR introduces a queue-based batching to the UC exporter. It is built as a pluggable component to the current async queue, rather than updating the existing queue directly, since it is used for handling multiple types of requests e.g. StartTrace request and other exporters.

The default is no-batch to reduce blast radius, but actually it might be ok to change given the product is private preview (easier to change now than after public release). Will revisit it with the team before merging.

How is this PR tested?

  • Existing unit/integration tests
  • New unit/integration tests
  • Manual tests
Screenshot 2025-12-11 at 9 56 19

Does this PR require documentation update?

Def need to be in documentation once the feature becomes public.

  • No. You can skip the rest of this section.
  • Yes. I've updated:
    • Examples
    • API references
    • Instructions

Release Notes

Is this a user-facing change?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release notes for MLflow users.

Support span batching when exporting spans to Databricks UC tables.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

  • area/tracking: Tracking Service, tracking client APIs, autologging
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/evaluation: MLflow model evaluation features, evaluation metrics, and evaluation workflows
  • area/gateway: MLflow AI Gateway client APIs, server, and third-party integrations
  • area/prompts: MLflow prompt engineering features, prompt templates, and prompt management
  • area/tracing: MLflow Tracing features, tracing APIs, and LLM tracing functionality
  • area/projects: MLproject format, project running backends
  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages

How should the PR be classified in the release notes? Choose one:

  • rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
  • rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
  • rn/feature - A new user-facing feature worth mentioning in the release notes
  • rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
  • rn/documentation - A user-facing documentation change worth mentioning in the release notes

Should this PR be included in the next patch release?

Yes should be selected for bug fixes, documentation updates, and other small changes. No should be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.

What is a minor/patch release?
  • Minor release: a release that increments the second part of the version number (e.g., 1.2.0 -> 1.3.0).
    Bug fixes, doc updates and new features usually go into minor releases.
  • Patch release: a release that increments the third part of the version number (e.g., 1.2.0 -> 1.2.1).
    Bug fixes and doc updates usually go into patch releases.
  • Yes (this PR will be cherry-picked and included in the next patch release)
  • No (this PR will be included in the next minor release)

Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>
Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>
@github-actions github-actions bot added area/tracing MLflow Tracing and its integrations rn/feature Mention under Features in Changelogs. labels Dec 11, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Dec 11, 2025

Documentation preview for e43fd72 is available at:

More info
  • Ignore this comment if this PR does not change the documentation.
  • The preview is updated when a new commit is pushed to this PR.
  • This comment was created by this workflow run.
  • The documentation was built by this workflow run.

Queue based batching processor for span export to Databricks Unity Catalog table.

Exposes two configuration knobs
- Max span batch size: The maximum number of spans to export in a single batch.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The span sizes may be massively skewed depending on the use case (e.g. 30MB for one use case while a few bytes for another). Therefore, the payload size would be preferable over just # of records per batch

Copy link
Collaborator Author

@B-Step62 B-Step62 Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair, though the batch span processor of OpenTelemetry only support count based, so I'm not sure if there is strong demand here. There is a downside that we need to serialize spans here to know the actual payload size, which introduces overhead. The batch condition is determined in a single thread so it can be a bottleneck when the system throughput is high.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chatted offline here. Let's start with the count based batching. If we end up seeing issues where individual batches are too large for some customers, they can

  1. reduce the batch size
  2. we can introduce a max payload size config in the future

Comment on lines +926 to +927
MLFLOW_ASYNC_TRACE_LOGGING_MAX_SPAN_BATCH_SIZE = _EnvironmentVariable(
"MLFLOW_ASYNC_TRACE_LOGGING_MAX_BATCH_SIZE", int, 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
MLFLOW_ASYNC_TRACE_LOGGING_MAX_SPAN_BATCH_SIZE = _EnvironmentVariable(
"MLFLOW_ASYNC_TRACE_LOGGING_MAX_BATCH_SIZE", int, 1
MLFLOW_ASYNC_TRACE_LOGGING_MAX_SPAN_BATCH_SIZE = _EnvironmentVariable(
"MLFLOW_ASYNC_TRACE_LOGGING_MAX_SPAN_BATCH_SIZE", int, 1

to be consistent

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx for the catch!

Comment on lines +51 to +52
for span in spans:
self._span_batcher.add_span(location=location, span=span)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add add_spans method instead? (without the batch size env var) Before this change all spans are exported in one request, but if we add_span one by one then there will be number of span requests?

Copy link
Collaborator Author

@B-Step62 B-Step62 Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate? Previously we export spans that is passed to export(spans) together, however, it is practically just a single span because span.end() directly triggers export without batching.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah that's true...
Does supporting BatchSpanProcessor also work?

Copy link
Collaborator Author

@B-Step62 B-Step62 Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I thought about it, but that's not non-trivial. We have lots of custom logic in current span processors, so I struggle adopting BatchSpanProcessor there. Also we need to move trace logging logic from exporter to our processors, because trace logging should not be batched together with span logging.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks for the explanation!

Copy link
Collaborator

@serena-ruan serena-ruan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>
@B-Step62 B-Step62 added this pull request to the merge queue Dec 12, 2025
Merged via the queue into mlflow:master with commit e9ed779 Dec 12, 2025
54 checks passed
@B-Step62 B-Step62 deleted the batch-span-export branch December 12, 2025 04:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/tracing MLflow Tracing and its integrations rn/feature Mention under Features in Changelogs. v3.8.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants