Skip to content

Add tracing integration to Gateway API endpoints#20495

Merged
TomeHirata merged 13 commits intomlflow:masterfrom
TomeHirata:stack/gateway-trace-api
Feb 5, 2026
Merged

Add tracing integration to Gateway API endpoints#20495
TomeHirata merged 13 commits intomlflow:masterfrom
TomeHirata:stack/gateway-trace-api

Conversation

@TomeHirata
Copy link
Collaborator

@TomeHirata TomeHirata commented Feb 2, 2026

🥞 Stacked PR

Use this link to review incremental changes.


Related Issues/PRs

n/a

What changes are proposed in this pull request?

Title

How is this PR tested?

  • Existing unit/integration tests
  • New unit/integration tests
  • Manual tests

Does this PR require documentation update?

  • No. You can skip the rest of this section.
  • Yes. I've updated:
    • Examples
    • API references
    • Instructions

Release Notes

Is this a user-facing change?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release notes for MLflow users.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

  • area/tracking: Tracking Service, tracking client APIs, autologging
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/evaluation: MLflow model evaluation features, evaluation metrics, and evaluation workflows
  • area/gateway: MLflow AI Gateway client APIs, server, and third-party integrations
  • area/prompts: MLflow prompt engineering features, prompt templates, and prompt management
  • area/tracing: MLflow Tracing features, tracing APIs, and LLM tracing functionality
  • area/projects: MLproject format, project running backends
  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages

How should the PR be classified in the release notes? Choose one:

  • rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
  • rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
  • rn/feature - A new user-facing feature worth mentioning in the release notes
  • rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
  • rn/documentation - A user-facing documentation change worth mentioning in the release notes

Should this PR be included in the next patch release?

Yes should be selected for bug fixes, documentation updates, and other small changes. No should be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.

What is a minor/patch release?
  • Minor release: a release that increments the second part of the version number (e.g., 1.2.0 -> 1.3.0).
    Bug fixes, doc updates and new features usually go into minor releases.
  • Patch release: a release that increments the third part of the version number (e.g., 1.2.0 -> 1.2.1).
    Bug fixes and doc updates usually go into patch releases.
  • Yes (this PR will be cherry-picked and included in the next patch release)
  • No (this PR will be included in the next minor release)

@TomeHirata TomeHirata force-pushed the stack/gateway-trace-api branch from c85cd23 to f62f74c Compare February 2, 2026 02:59
@TomeHirata TomeHirata marked this pull request as ready for review February 2, 2026 03:10
Copilot AI review requested due to automatic review settings February 2, 2026 03:10
@github-actions
Copy link
Contributor

github-actions bot commented Feb 2, 2026

🛠 DevTools 🛠

Install mlflow from this PR

# mlflow
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/20495/merge
# mlflow-skinny
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/20495/merge#subdirectory=libs/skinny

For Databricks, use the following command:

%sh curl -LsSf https://raw.githubusercontent.com/mlflow/mlflow/HEAD/dev/install-skinny.sh | sh -s pull/20495/merge

@github-actions
Copy link
Contributor

github-actions bot commented Feb 2, 2026

Documentation preview for cabbc7c is available at:

More info
  • Ignore this comment if this PR does not change the documentation.
  • The preview is updated when a new commit is pushed to this PR.
  • This comment was created by this workflow run.
  • The documentation was built by this workflow run.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds first-class tracing / usage-tracking support for MLflow Gateway endpoints by persisting a tracing destination (experiment) on endpoints, instrumenting provider calls with spans (including streaming token usage where available), and exposing usage-tracking configuration in the Gateway UI.

Changes:

  • Add usage_tracking + experiment_id to Gateway endpoint persistence (DB schema, migrations, protos, entities, REST/SQL stores).
  • Instrument Gateway request handling and providers with MLflow tracing spans, including streamed token-usage extraction for select providers.
  • Extend UI and tests to configure usage tracking + select experiments and validate trace creation.

Reviewed changes

Copilot reviewed 47 out of 49 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
tests/tracking/test_rest_tracking.py Adds coverage for auto-creating an experiment when usage_tracking=True.
tests/store/tracking/test_rest_store.py Updates REST store request expectations to include usage_tracking.
tests/store/tracking/test_gateway_sql_store.py Extends SQL store endpoint creation test with usage-tracking fields.
tests/server/test_gateway_api.py Updates gateway API tests for traced providers, streaming, and trace creation.
tests/resources/db/latest_schema.sql Updates latest test DB schema with new endpoint columns.
tests/gateway/schemas/test_completions.py Validates streaming completions schema with optional usage.
tests/gateway/schemas/test_chat.py Validates streaming chat schema with optional usage.
tests/gateway/providers/test_tracing.py Adds unit tests for provider tracing wrapper behavior.
tests/gateway/providers/test_togetherai.py Updates TogetherAI streaming expectations to include usage: None / usage final chunk.
tests/gateway/providers/test_openai.py Updates OpenAI streaming tests for stream_options.include_usage + optional usage field.
tests/gateway/providers/test_gemini.py Updates Gemini streaming tests to include optional usage.
tests/gateway/providers/test_cohere.py Updates Cohere streaming tests to include usage objects.
tests/gateway/providers/test_anthropic.py Updates Anthropic streaming tests + validates usage extraction changes.
tests/db/schemas/sqlite.sql Adds experiment_id and usage_tracking to SQLite test schema.
tests/db/schemas/postgresql.sql Adds experiment_id and usage_tracking to Postgres test schema.
tests/db/schemas/mysql.sql Adds experiment_id and usage_tracking to MySQL test schema.
tests/db/schemas/mssql.sql Adds experiment_id and usage_tracking to MSSQL test schema.
mlflow/types/chat.py Adds optional usage field to chat completion chunk type.
mlflow/tracing/constant.py Adds gateway-related trace metadata keys + provider/model span attributes.
mlflow/store/tracking/gateway/sqlalchemy_mixin.py Persists experiment_id + usage_tracking for endpoint create/update.
mlflow/store/tracking/gateway/rest_mixin.py Extends REST store endpoint create/update payloads with usage-tracking fields.
mlflow/store/tracking/gateway/entities.py Adds experiment_id to resolved endpoint config entity.
mlflow/store/tracking/gateway/config_resolver.py Propagates experiment_id into runtime endpoint config.
mlflow/store/tracking/gateway/abstract_mixin.py Updates abstract store contract/docs for usage tracking + experiment handling.
mlflow/store/tracking/dbmodels/models.py Adds new endpoint columns to SQLAlchemy DB model and entity conversion.
mlflow/store/db_migrations/versions/d0e1f2a3b4c5_add_experiment_id_to_endpoints.py Adds Alembic migration for new endpoint columns.
mlflow/server/js/src/lang/default/en.json Adds UI strings for usage tracking + experiment selection + “View traces”.
mlflow/server/js/src/gateway/types.ts Extends Gateway endpoint/request types with usage-tracking fields.
mlflow/server/js/src/gateway/pages/EndpointPage.tsx Passes experiment id into edit form renderer to surface trace link.
mlflow/server/js/src/gateway/hooks/useExperimentsForSelect.ts Adds query hook to fetch experiments for selection.
mlflow/server/js/src/gateway/hooks/useCreateEndpointForm.ts Adds usage tracking + experiment selection to create-endpoint submit payload.
mlflow/server/js/src/gateway/components/endpoint-form/EndpointFormRenderer.tsx Adds create-time usage tracking toggle + experiment selector UI.
mlflow/server/js/src/gateway/components/edit-endpoint/EditEndpointFormRenderer.tsx Adds “View traces” link when endpoint has an experiment id.
mlflow/server/js/src/gateway/components/create-endpoint/ExperimentSelect.tsx Adds experiment select component for the create form.
mlflow/server/handlers.py Adds server-side experiment auto-creation logic for usage-tracked endpoints.
mlflow/server/gateway_api.py Adds gateway span creation + traced streaming response wrapper; wraps providers for tracing.
mlflow/protos/service_pb2.pyi Updates Python proto stubs for new endpoint fields.
mlflow/protos/service.proto Adds experiment_id + usage_tracking to gateway endpoint create/update protos.
mlflow/gateway/utils.py Updates SSE serialization to use model_dump_json() for pydantic v2.
mlflow/gateway/schemas/completions.py Adds optional usage to streaming completions response schema.
mlflow/gateway/providers/tracing.py Introduces TracingProviderWrapper for provider method span instrumentation.
mlflow/gateway/providers/openai.py Adds streamed usage extraction + injects stream_options.include_usage.
mlflow/gateway/providers/litellm.py Adds get_provider_name() override for more accurate tracing labels.
mlflow/gateway/providers/gemini.py Adds streamed usage extraction from usageMetadata.
mlflow/gateway/providers/base.py Adds get_provider_name() default to support tracing/metrics naming.
mlflow/gateway/providers/anthropic.py Adds streamed usage aggregation across events and exposes it on chunks.
mlflow/entities/gateway_endpoint.py Extends GatewayEndpoint entity (proto conversion) with new fields.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@TomeHirata TomeHirata force-pushed the stack/gateway-trace-api branch 12 times, most recently from 0a1d073 to b18ba2f Compare February 3, 2026 07:21
@TomeHirata TomeHirata force-pushed the stack/gateway-trace-api branch from de9b744 to 8368e29 Compare February 4, 2026 05:17
Signed-off-by: Tomu Hirata <tomu.hirata@gmail.com>
Signed-off-by: Tomu Hirata <tomu.hirata@gmail.com>
Signed-off-by: Tomu Hirata <tomu.hirata@gmail.com>
Signed-off-by: Tomu Hirata <tomu.hirata@gmail.com>
@TomeHirata TomeHirata added the team-review Trigger a team review request label Feb 5, 2026
Signed-off-by: Tomu Hirata <tomu.hirata@gmail.com>
@github-actions github-actions bot added the rn/none List under Small Changes in Changelogs. label Feb 5, 2026
@harupy
Copy link
Member

harupy commented Feb 5, 2026

/review


🚀 Review workflow started


✅ Review completed.

Review Output

The code looks consistent with the implementation. Let me now finalize my review. After careful examination of all the changed lines in this PR, I have reviewed:

  1. mlflow/gateway/tracing_utils.py - New utility module
  2. mlflow/server/gateway_api.py - Integration changes
  3. tests/server/test_gateway_api.py - Test updates
  4. tests/telemetry/test_tracked_events.py - Test updates

After thorough review of the changed lines only, here's my assessment:

No issues found

The PR changes are well-structured:

  1. tracing_utils.py: The traced_gateway_call function correctly handles the case when tracing is disabled (no experiment_id) by returning the original function unchanged, and properly wraps it with mlflow.trace when enabled.

  2. gateway_api.py: The changes consistently update all endpoint handlers to:

    • Unpack the new tuple return from _create_provider_from_endpoint_name
    • Use traced_gateway_call to wrap provider methods
    • Replace the removed make_streaming_response with inline StreamingResponse using to_sse_chunk
  3. Test updates: Tests properly mock the new tuple return value (mock_provider, mock_endpoint_config) and update streaming test assertions to work with the new implementation.

  4. The import changes: make_streaming_response is removed and replaced with to_sse_chunk and inline StreamingResponse, which is cleaner.

  5. The change from body.get("stream") to body.get("stream", False) is a minor improvement that makes the default explicit.

The code follows consistent patterns, has proper type hints, and the tests comprehensively cover the new tracing functionality including both synchronous and streaming cases.

No issues found

Signed-off-by: Tomu Hirata <tomu.hirata@gmail.com>
@TomeHirata TomeHirata requested a review from harupy February 5, 2026 04:55
endpoint_type: EndpointType,
enable_tracing: bool = True,
) -> BaseProvider:
) -> tuple[BaseProvider, GatewayEndpointConfig]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#20495 (comment)

This is surprising that we have EndpointConfig and GatewayEndpointConfig, it's confusing. I feel like we should rename EndpointConfig to ProviderConfig but not related to this PR.

Could we rename this function though, and probably use a NamedTuple for return value to be more explicit

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think we'll consolidate them once we deprecate or integrate the legacy gateway workflow

from mlflow.store.tracking.gateway.entities import GatewayEndpointConfig


def traced_gateway_call(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit for this function name, since it may or may not be traced depending on the config, maybe something like apply_gateway_tracing_config

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe_traced_gateway_call?

Copy link
Collaborator

@serena-ruan serena-ruan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM! Left some nits :)

Signed-off-by: Tomu Hirata <tomu.hirata@gmail.com>
Signed-off-by: Tomu Hirata <tomu.hirata@gmail.com>
@TomeHirata TomeHirata enabled auto-merge February 5, 2026 07:32
@TomeHirata TomeHirata added this pull request to the merge queue Feb 5, 2026
Merged via the queue into mlflow:master with commit 4eb654b Feb 5, 2026
52 checks passed
@TomeHirata TomeHirata deleted the stack/gateway-trace-api branch February 5, 2026 08:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

rn/none List under Small Changes in Changelogs. team-review Trigger a team review request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants