Support batch span export to UC Table#19324
Conversation
Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>
Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>
|
Documentation preview for e43fd72 is available at: More info
|
| Queue based batching processor for span export to Databricks Unity Catalog table. | ||
|
|
||
| Exposes two configuration knobs | ||
| - Max span batch size: The maximum number of spans to export in a single batch. |
There was a problem hiding this comment.
The span sizes may be massively skewed depending on the use case (e.g. 30MB for one use case while a few bytes for another). Therefore, the payload size would be preferable over just # of records per batch
There was a problem hiding this comment.
That's fair, though the batch span processor of OpenTelemetry only support count based, so I'm not sure if there is strong demand here. There is a downside that we need to serialize spans here to know the actual payload size, which introduces overhead. The batch condition is determined in a single thread so it can be a bottleneck when the system throughput is high.
There was a problem hiding this comment.
Chatted offline here. Let's start with the count based batching. If we end up seeing issues where individual batches are too large for some customers, they can
- reduce the batch size
- we can introduce a max payload size config in the future
mlflow/environment_variables.py
Outdated
| MLFLOW_ASYNC_TRACE_LOGGING_MAX_SPAN_BATCH_SIZE = _EnvironmentVariable( | ||
| "MLFLOW_ASYNC_TRACE_LOGGING_MAX_BATCH_SIZE", int, 1 |
There was a problem hiding this comment.
| MLFLOW_ASYNC_TRACE_LOGGING_MAX_SPAN_BATCH_SIZE = _EnvironmentVariable( | |
| "MLFLOW_ASYNC_TRACE_LOGGING_MAX_BATCH_SIZE", int, 1 | |
| MLFLOW_ASYNC_TRACE_LOGGING_MAX_SPAN_BATCH_SIZE = _EnvironmentVariable( | |
| "MLFLOW_ASYNC_TRACE_LOGGING_MAX_SPAN_BATCH_SIZE", int, 1 |
to be consistent
There was a problem hiding this comment.
thx for the catch!
| for span in spans: | ||
| self._span_batcher.add_span(location=location, span=span) |
There was a problem hiding this comment.
Could we add add_spans method instead? (without the batch size env var) Before this change all spans are exported in one request, but if we add_span one by one then there will be number of span requests?
There was a problem hiding this comment.
Could you elaborate? Previously we export spans that is passed to export(spans) together, however, it is practically just a single span because span.end() directly triggers export without batching.
There was a problem hiding this comment.
Ah yeah that's true...
Does supporting BatchSpanProcessor also work?
There was a problem hiding this comment.
Yeah I thought about it, but that's not non-trivial. We have lots of custom logic in current span processors, so I struggle adopting BatchSpanProcessor there. Also we need to move trace logging logic from exporter to our processors, because trace logging should not be batched together with span logging.
There was a problem hiding this comment.
I see, thanks for the explanation!
🛠 DevTools 🛠
Install mlflow from this PR
For Databricks, use the following command:
What changes are proposed in this pull request?
For Zerobus traces, the
TraceExportendpoint handle batch of spans. However, we are not making use of it since the spans passed to the exporter is exported immediately (via async queue), namely, every span creates a new export request. This is highly inefficient and limit scalability.This PR introduces a queue-based batching to the UC exporter. It is built as a pluggable component to the current async queue, rather than updating the existing queue directly, since it is used for handling multiple types of requests e.g. StartTrace request and other exporters.
The default is no-batch to reduce blast radius, but actually it might be ok to change given the product is private preview (easier to change now than after public release). Will revisit it with the team before merging.
How is this PR tested?
Does this PR require documentation update?
Def need to be in documentation once the feature becomes public.
Release Notes
Is this a user-facing change?
Support span batching when exporting spans to Databricks UC tables.
What component(s), interfaces, languages, and integrations does this PR affect?
Components
area/tracking: Tracking Service, tracking client APIs, autologgingarea/models: MLmodel format, model serialization/deserialization, flavorsarea/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registryarea/scoring: MLflow Model server, model deployment tools, Spark UDFsarea/evaluation: MLflow model evaluation features, evaluation metrics, and evaluation workflowsarea/gateway: MLflow AI Gateway client APIs, server, and third-party integrationsarea/prompts: MLflow prompt engineering features, prompt templates, and prompt managementarea/tracing: MLflow Tracing features, tracing APIs, and LLM tracing functionalityarea/projects: MLproject format, project running backendsarea/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev serverarea/build: Build and test infrastructure for MLflowarea/docs: MLflow documentation pagesHow should the PR be classified in the release notes? Choose one:
rn/none- No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" sectionrn/breaking-change- The PR will be mentioned in the "Breaking Changes" sectionrn/feature- A new user-facing feature worth mentioning in the release notesrn/bug-fix- A user-facing bug fix worth mentioning in the release notesrn/documentation- A user-facing documentation change worth mentioning in the release notesShould this PR be included in the next patch release?
Yesshould be selected for bug fixes, documentation updates, and other small changes.Noshould be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.What is a minor/patch release?
Bug fixes, doc updates and new features usually go into minor releases.
Bug fixes and doc updates usually go into patch releases.