Skip to content

[Endpoints] [8/x] Add credential cache#19014

Merged
BenWilson2 merged 1 commit intomlflow:masterfrom
BenWilson2:stack/endpoints/cache
Dec 9, 2025
Merged

[Endpoints] [8/x] Add credential cache#19014
BenWilson2 merged 1 commit intomlflow:masterfrom
BenWilson2:stack/endpoints/cache

Conversation

@BenWilson2
Copy link
Member

@BenWilson2 BenWilson2 commented Nov 24, 2025

🥞 Stacked PR

Use this link to review incremental changes.


Related Issues/PRs

#xxx

What changes are proposed in this pull request?

Add the credential cache layer to protect the DB against high-volume server requests

How is this PR tested?

  • Existing unit/integration tests
  • New unit/integration tests
  • Manual tests

Does this PR require documentation update?

  • No. You can skip the rest of this section.
  • Yes. I've updated:
    • Examples
    • API references
    • Instructions

Release Notes

Is this a user-facing change?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release notes for MLflow users.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

  • area/tracking: Tracking Service, tracking client APIs, autologging
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/evaluation: MLflow model evaluation features, evaluation metrics, and evaluation workflows
  • area/gateway: MLflow AI Gateway client APIs, server, and third-party integrations
  • area/prompts: MLflow prompt engineering features, prompt templates, and prompt management
  • area/tracing: MLflow Tracing features, tracing APIs, and LLM tracing functionality
  • area/projects: MLproject format, project running backends
  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages

How should the PR be classified in the release notes? Choose one:

  • rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
  • rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
  • rn/feature - A new user-facing feature worth mentioning in the release notes
  • rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
  • rn/documentation - A user-facing documentation change worth mentioning in the release notes

Should this PR be included in the next patch release?

Yes should be selected for bug fixes, documentation updates, and other small changes. No should be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.

What is a minor/patch release?
  • Minor release: a release that increments the second part of the version number (e.g., 1.2.0 -> 1.3.0).
    Bug fixes, doc updates and new features usually go into minor releases.
  • Patch release: a release that increments the third part of the version number (e.g., 1.2.0 -> 1.2.1).
    Bug fixes and doc updates usually go into patch releases.
  • Yes (this PR will be cherry-picked and included in the next patch release)
  • No (this PR will be included in the next minor release)

@BenWilson2 BenWilson2 changed the title add secure cache layer [Endpoints] [8/x] Add credential cache Nov 24, 2025
@BenWilson2 BenWilson2 marked this pull request as ready for review November 24, 2025 22:04
@BenWilson2 BenWilson2 force-pushed the stack/endpoints/cache branch from af09243 to 8f2a8c9 Compare November 24, 2025 22:19
@github-actions github-actions bot added area/tracking Tracking service, tracking client APIs, autologging rn/feature Mention under Features in Changelogs. labels Nov 24, 2025
@BenWilson2 BenWilson2 force-pushed the stack/endpoints/cache branch 2 times, most recently from a94d442 to db7722b Compare November 26, 2025 03:40
@BenWilson2 BenWilson2 force-pushed the stack/endpoints/cache branch from db7722b to 07e10ca Compare November 27, 2025 06:31
@BenWilson2 BenWilson2 force-pushed the stack/endpoints/cache branch 2 times, most recently from bac5b39 to 3ea4110 Compare December 2, 2025 02:14
@BenWilson2 BenWilson2 force-pushed the stack/endpoints/cache branch 2 times, most recently from 3f2d368 to 133bfe7 Compare December 2, 2025 22:49
@github-actions
Copy link
Contributor

github-actions bot commented Dec 2, 2025

Documentation preview for 5187652 is available at:

More info
  • Ignore this comment if this PR does not change the documentation.
  • The preview is updated when a new commit is pushed to this PR.
  • This comment was created by this workflow run.
  • The documentation was built by this workflow run.

@BenWilson2 BenWilson2 force-pushed the stack/endpoints/cache branch 3 times, most recently from d680b39 to f0df939 Compare December 4, 2025 00:27
@BenWilson2 BenWilson2 force-pushed the stack/endpoints/cache branch 9 times, most recently from 154a77e to 9447a37 Compare December 7, 2025 01:57

from mlflow.utils.crypto import decrypt_with_aes_gcm, encrypt_with_aes_gcm

MIN_TTL = 10
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we add "_" prefix for these internal consts?

Comment on lines +69 to +72
SECRETS_CACHE_TTL_ENV_VAR = "_MLFLOW_SERVER_SECRETS_CACHE_TTL"

# Maximum number of entries in server-side secrets cache (default 1000 entries)
SECRETS_CACHE_MAX_SIZE_ENV_VAR = "_MLFLOW_SERVER_SECRETS_CACHE_MAX_SIZE"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is expected for users configure these variables, so technically these variables are not "internal", correct? Then shall we remove "_" from the env var name?

session.flush()
session.refresh(sql_secret)

self._invalidate_secret_cache()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to invalidate all cached records every time? What is the downside of per-key invalidation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly complexity. Handling per-record TTL and invalidation only for the mutated key just simply adds complexity for an action that is (hopefully) so infrequent that it doesn't really make sense to add that complexity.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense, let's start with the complete invalidation and then revisit if we hit performance issues

Comment on lines +74 to +76
def _derive_bucket_key(self, time_bucket: int) -> bytes:
bucket_bytes = time_bucket.to_bytes(8, byteorder="big")
return hashlib.sha256(self._base_key + bucket_bytes).digest()
Copy link
Collaborator

@TomeHirata TomeHirata Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[sorry if I misunderstand the time-bucketed ephemeral encryption flow]
With the current secret derivation I wonder if it's not ensured that secrets in time-bucketed cache are "permanently unrecoverable after a short time window even if an attacker gains access to the base key via a memory dump".
Let's say a secret is stored in the cache at 00:00:00 and ttl_seconds=60, and then attackers dump the memory at 00:05:00. Attackers figure out the base key and encrypted secret strings at 00:10:00. They can compute bucket key by trying the time bucket number one by one (10 -> 9 -> ... -> 0) and find the time bucket key that can decrypt the secrets relatively efficiently:

enc = EphemeralCacheEncryption()
enc._base_key='extracted_base_key'
time_bucket = enc._get_time_bucket()
while (True):
  try:
    secret = enc.decrypt('extracted_encrypted_bytes', time_bucket)
    return secret
  catch Exception:
     time_bucket -= 1

If my statement above is true, probably we should generate a time bucket key in a way that cannot be computed from base_key and time bucket, and then have a separate thread to remove the time bucket secret from memory if it's expired.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point on the ease of cracking. While we definitely can't make this completely iron-clad (the only way to do this is through hardware-based encryption where the key is generated and decrypted external to memory such that the CPU never receives or processes the decryption), we can make this an order of magnitude harder to crack by using a random value that is purged in a separate process when the TTL triggers.
I'm going to go with that approach and also list out the limitations of what we're doing here (for what it's worth, other OSS implementations use even lower security grades to the time window encryption; going with the random encryption value upgrades this to well beyond anything else that I can find).
The case now is, at worst, with someone who has root access to the tracking server, they could theoretically get cached keys that are present in the cache within the time window iff they find them in the memory dump AND they can figure out which bytearray represents the random decryption key. This now effectively becomes finding the right two needles in a stack of needles (not impossible, but highly implausible that anyone would want to spend weeks going through bytearray permutations to find 2 bytearrays in a sea of bytearrays to satisfy decryption).
This satisfies the defense in depth requirements for Mitre's posting (reference included in the docstrings).

@BenWilson2 BenWilson2 force-pushed the stack/endpoints/cache branch 2 times, most recently from 0ed7d02 to e26b88a Compare December 8, 2025 21:49
@BenWilson2 BenWilson2 requested a review from TomeHirata December 8, 2025 21:49
current_bucket = self._get_time_bucket()

# Purge expired bucket keys (more than 1 bucket old)
expired = [b for b in self._bucket_keys if abs(current_bucket - b) > 1]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we have a separate thread that continuously clean up the expired bucket keys? Otherwise expired keys stay on memory until encrypt or decrypt is called again (probably not a huge blocker, but just wondered).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally - sorry for not including the separate thread on the last push (I forgot to push the stack early last night which had this change included - it's critical to have the timed daemon thread in case the cache isn't hit after n time period (otherwise the cache could stay warm far past the TTL).

del self._bucket_keys[b]

# For decryption of slightly old buckets, return existing key if available
if time_bucket in self._bucket_keys:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: If we store up to one bucket key, we can have active_bucket: int and active_bucket_key: bytes separately instead of dict[int, bytes]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great call - I'll create the typed entries since we only have 4 values by design.

Copy link
Collaborator

@TomeHirata TomeHirata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a suggestion on a potential risky case, otherwise LGTM

Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
@BenWilson2 BenWilson2 force-pushed the stack/endpoints/cache branch from e26b88a to 5187652 Compare December 9, 2025 20:26
@BenWilson2 BenWilson2 added this pull request to the merge queue Dec 9, 2025
Merged via the queue into mlflow:master with commit e95a16b Dec 9, 2025
69 of 71 checks passed
@BenWilson2 BenWilson2 deleted the stack/endpoints/cache branch December 9, 2025 21:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/tracking Tracking service, tracking client APIs, autologging rn/feature Mention under Features in Changelogs. team-review Trigger a team review request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants