Skip to content

[Endpoints] [4/x] Abstract store interface#19005

Merged
BenWilson2 merged 2 commits intomlflow:masterfrom
BenWilson2:stack/endpoints/abstract
Dec 5, 2025
Merged

[Endpoints] [4/x] Abstract store interface#19005
BenWilson2 merged 2 commits intomlflow:masterfrom
BenWilson2:stack/endpoints/abstract

Conversation

@BenWilson2
Copy link
Member

@BenWilson2 BenWilson2 commented Nov 24, 2025

🥞 Stacked PR

Use this link to review incremental changes.


Related Issues/PRs

#xxx

What changes are proposed in this pull request?

Add the abstract store interface for the endpoints and secrets management APIs

How is this PR tested?

  • Existing unit/integration tests
  • New unit/integration tests
  • Manual tests

Does this PR require documentation update?

  • No. You can skip the rest of this section.
  • Yes. I've updated:
    • Examples
    • API references
    • Instructions

Release Notes

Is this a user-facing change?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release notes for MLflow users.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

  • area/tracking: Tracking Service, tracking client APIs, autologging
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/evaluation: MLflow model evaluation features, evaluation metrics, and evaluation workflows
  • area/gateway: MLflow AI Gateway client APIs, server, and third-party integrations
  • area/prompts: MLflow prompt engineering features, prompt templates, and prompt management
  • area/tracing: MLflow Tracing features, tracing APIs, and LLM tracing functionality
  • area/projects: MLproject format, project running backends
  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages

How should the PR be classified in the release notes? Choose one:

  • rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
  • rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
  • rn/feature - A new user-facing feature worth mentioning in the release notes
  • rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
  • rn/documentation - A user-facing documentation change worth mentioning in the release notes

Should this PR be included in the next patch release?

Yes should be selected for bug fixes, documentation updates, and other small changes. No should be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.

What is a minor/patch release?
  • Minor release: a release that increments the second part of the version number (e.g., 1.2.0 -> 1.3.0).
    Bug fixes, doc updates and new features usually go into minor releases.
  • Patch release: a release that increments the third part of the version number (e.g., 1.2.0 -> 1.2.1).
    Bug fixes and doc updates usually go into patch releases.
  • Yes (this PR will be cherry-picked and included in the next patch release)
  • No (this PR will be included in the next minor release)

@github-actions
Copy link
Contributor

github-actions bot commented Nov 26, 2025

Documentation preview for ad08499 is available at:

More info
  • Ignore this comment if this PR does not change the documentation.
  • The preview is updated when a new commit is pushed to this PR.
  • This comment was created by this workflow run.
  • The documentation was built by this workflow run.

@BenWilson2 BenWilson2 force-pushed the stack/endpoints/abstract branch 2 times, most recently from 9f711e6 to 62710ce Compare December 3, 2025 01:52
raise NotImplementedError(self.__class__.__name__)

# ===================================================================================
# Model Definitions Management
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The term "model" is over-loaded. Is there any way to make it clear these APIs are for gateway models? Maybe we can define a mixin?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! I think all of the stores should have mixins for them in order to isolate these and make it even more clear for us that these are UI-only endpoints / methods. I'll refactor to use mixins for the ABC, sql store, and rest store.

@BenWilson2 BenWilson2 force-pushed the stack/endpoints/abstract branch 3 times, most recently from 3a572c2 to 45dcd4f Compare December 4, 2025 01:54
Copilot AI review requested due to automatic review settings December 4, 2025 01:54
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces abstract store interface methods for secrets management and endpoint configuration as part of the MLflow Gateway/LLM endpoints feature. It provides comprehensive cryptographic infrastructure using envelope encryption (KEK/DEK pattern) following OWASP security best practices.

Key changes:

  • Adds cryptography utilities with AES-256-GCM encryption and PBKDF2 key derivation
  • Defines entity models for secrets, endpoints, model definitions, and bindings
  • Extends abstract store interface with 15 new methods for secrets and endpoints management

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
mlflow/utils/cryptography.py Core cryptographic utilities implementing envelope encryption with KEK/DEK pattern, following OWASP guidelines (600k PBKDF2 iterations)
mlflow/entities/gateway_secrets.py GatewaySecret entity for encrypted API key storage
mlflow/entities/gateway_endpoint.py Entity models for endpoints, model definitions, bindings, and runtime configurations
mlflow/entities/init.py Exports new gateway-related entities
mlflow/store/tracking/abstract_store.py Abstract interface methods for CRUD operations on secrets, model definitions, endpoints, and bindings
mlflow/cli/cryptography.py CLI command for KEK rotation with safety confirmations and progress reporting
mlflow/cli/init.py Registers crypto CLI group with optional import handling
tests/utils/test_cryptography.py Comprehensive unit tests for cryptographic functions (606 lines)
tests/entities/test_gateway_secrets.py Unit tests for GatewaySecret entity
tests/entities/test_gateway_endpoint.py Unit tests for endpoint-related entities (423 lines)
tests/cli/test_cryptography.py Extensive CLI tests covering success/failure scenarios (443 lines)
docs/api_reference/api_inventory.txt Adds new entities to API documentation inventory

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@BenWilson2 BenWilson2 force-pushed the stack/endpoints/abstract branch 2 times, most recently from 49a8458 to 903c1e8 Compare December 4, 2025 23:40
from mlflow.store.tracking.gateway.entities import GatewayEndpointConfig


class GatewayStoreMixin:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we split these methods as a Mixin because tracking store is getting heavy?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly to separate out the ambiguous terms "model" (in MLflow that means something different in our public APIs), but also, yeah, sqlalchemy_store.py is the Titanic of Python modules

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense, we should probably do some refactoring of tracking stores using mixin

"""
raise NotImplementedError(self.__class__.__name__)

def get_secret_info(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make the resource name consistent?

Suggested change
def get_secret_info(
def get_secret(

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I selected the 'info' part simply because I wanted to convey that this returns the metadata and not the actual secret. I'm not convinced that it's needed... curious to hear your thoughts on my reasoning (perhaps I'm just being too paranoid ;) )

Copy link
Collaborator

@TomeHirata TomeHirata Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I proposed get_secret in order to be consistent about the terminology. Since we return GatewaySecret, the method should be called get_secret or get_gateway_secrete. If the GatewaySecret entity does not include actual secret, GatewaySecret might be a misleading name. So I'd call it GatewaySecretInfo. What do you think?

Copy link
Collaborator

@TomeHirata TomeHirata Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if we call this method get_secret_info, then can we change list_secrets into list_secret_info to be consistent?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call! Since that is metadata as well :)

self,
resource_type: str,
resource_id: str,
) -> GatewayEndpointConfig:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is endpoint uniquely identified by resource_type and resource_id? If the composite key (endpoint_id, resource_type, resource_id) is unique, should the returned value list[GatewayEndpointConfig] or list[GatewayEndpoint]?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really great point. For the initial version I've been focused on the concept of multiple resources -> single endpoint. But there's no reason to restrict this since the DB model supports this without any issues.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, can we return multiple endpoints here to be future proof? We might want to have multiple endpoints attached to a single resource in the future. We can add the single restriction in the online-scorer specific logic in the downstream if needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep making the change is in progress right now ;)

@BenWilson2 BenWilson2 force-pushed the stack/endpoints/abstract branch from 903c1e8 to 5bdfd7f Compare December 5, 2025 04:29
Copy link
Collaborator

@B-Step62 B-Step62 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, #19005 (comment) is not blocking but I imagine it would be cleaner to move the method outside the store if it has special access requirement.

"""
raise NotImplementedError(self.__class__.__name__)

def get_resource_endpoint_configs(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to downstream invoke this API? Usually store methods are invoked through MlflowClient, but in this case, it is a bit weird because we don't want anything outside server to call this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah having it isolated and not even in the store implementation makes sense. I put the entities that back the responses here into the store layer, so this seems like a logical location to consolidate that access

Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
@BenWilson2 BenWilson2 force-pushed the stack/endpoints/abstract branch from 5bdfd7f to ad08499 Compare December 5, 2025 06:50
@BenWilson2 BenWilson2 added this pull request to the merge queue Dec 5, 2025
Merged via the queue into mlflow:master with commit a6167b1 Dec 5, 2025
50 checks passed
@BenWilson2 BenWilson2 deleted the stack/endpoints/abstract branch December 5, 2025 16:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/tracking Tracking service, tracking client APIs, autologging rn/feature Mention under Features in Changelogs. team-review Trigger a team review request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants