[Endpoints] [8/x] Add credential cache by BenWilson2 · Pull Request #19014 · mlflow/mlflow

BenWilson2 · 2025-11-24T21:23:55Z

🥞 Stacked PR

Use this link to review incremental changes.

stack/endpoints/cache [Files changed]
- stack/endpoints/litellm [Files changed]
  - stack/endpoints/ui-apis [Files changed]
    - stack/endpoints/ui-create-endpoint [Files changed]
      - stack/endpoints/ui-tabs [Files changed]
        
        stack/endpoints/key-management [Files changed]
        
        stack/endpoints/passphrase [Files changed]

Related Issues/PRs

#xxx

What changes are proposed in this pull request?

Add the credential cache layer to protect the DB against high-volume server requests

How is this PR tested?

Existing unit/integration tests
New unit/integration tests
Manual tests

Does this PR require documentation update?

Release Notes

Is this a user-facing change?

No. You can skip the rest of this section.
Yes. Give a description of this change to be included in the release notes for MLflow users.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

How should the PR be classified in the release notes? Choose one:

rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
rn/feature - A new user-facing feature worth mentioning in the release notes
rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
rn/documentation - A user-facing documentation change worth mentioning in the release notes

Should this PR be included in the next patch release?

Yes should be selected for bug fixes, documentation updates, and other small changes. No should be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.

What is a minor/patch release?

Minor release: a release that increments the second part of the version number (e.g., 1.2.0 -> 1.3.0).
Bug fixes, doc updates and new features usually go into minor releases.
Patch release: a release that increments the third part of the version number (e.g., 1.2.0 -> 1.2.1).
Bug fixes and doc updates usually go into patch releases.

Yes (this PR will be cherry-picked and included in the next patch release)
No (this PR will be included in the next minor release)

github-actions · 2025-12-02T22:57:12Z

Documentation preview for 5187652 is available at:

https://pr-19014--mlflow-docs-preview.netlify.app/docs/latest/

More info

Ignore this comment if this PR does not change the documentation.
The preview is updated when a new commit is pushed to this PR.
This comment was created by this workflow run.
The documentation was built by this workflow run.

TomeHirata · 2025-12-08T05:12:43Z

mlflow/store/tracking/_secret_cache.py

+
+from mlflow.utils.crypto import decrypt_with_aes_gcm, encrypt_with_aes_gcm
+
+MIN_TTL = 10


nit: can we add "_" prefix for these internal consts?

TomeHirata · 2025-12-08T05:25:38Z

mlflow/server/constants.py

+SECRETS_CACHE_TTL_ENV_VAR = "_MLFLOW_SERVER_SECRETS_CACHE_TTL"
+
+# Maximum number of entries in server-side secrets cache (default 1000 entries)
+SECRETS_CACHE_MAX_SIZE_ENV_VAR = "_MLFLOW_SERVER_SECRETS_CACHE_MAX_SIZE"


It is expected for users configure these variables, so technically these variables are not "internal", correct? Then shall we remove "_" from the env var name?

TomeHirata · 2025-12-08T05:38:38Z

mlflow/store/tracking/gateway/sqlalchemy_mixin.py

            session.flush()
            session.refresh(sql_secret)

+            self._invalidate_secret_cache()


Do we want to invalidate all cached records every time? What is the downside of per-key invalidation?

Mostly complexity. Handling per-record TTL and invalidation only for the mutated key just simply adds complexity for an action that is (hopefully) so infrequent that it doesn't really make sense to add that complexity.

Make sense, let's start with the complete invalidation and then revisit if we hit performance issues

TomeHirata · 2025-12-08T06:06:41Z

mlflow/store/tracking/_secret_cache.py

+    def _derive_bucket_key(self, time_bucket: int) -> bytes:
+        bucket_bytes = time_bucket.to_bytes(8, byteorder="big")
+        return hashlib.sha256(self._base_key + bucket_bytes).digest()


[sorry if I misunderstand the time-bucketed ephemeral encryption flow]
With the current secret derivation I wonder if it's not ensured that secrets in time-bucketed cache are "permanently unrecoverable after a short time window even if an attacker gains access to the base key via a memory dump".
Let's say a secret is stored in the cache at 00:00:00 and ttl_seconds=60, and then attackers dump the memory at 00:05:00. Attackers figure out the base key and encrypted secret strings at 00:10:00. They can compute bucket key by trying the time bucket number one by one (10 -> 9 -> ... -> 0) and find the time bucket key that can decrypt the secrets relatively efficiently:

enc = EphemeralCacheEncryption() enc._base_key='extracted_base_key' time_bucket = enc._get_time_bucket() while (True): try: secret = enc.decrypt('extracted_encrypted_bytes', time_bucket) return secret catch Exception: time_bucket -= 1

If my statement above is true, probably we should generate a time bucket key in a way that cannot be computed from base_key and time bucket, and then have a separate thread to remove the time bucket secret from memory if it's expired.

This is a good point on the ease of cracking. While we definitely can't make this completely iron-clad (the only way to do this is through hardware-based encryption where the key is generated and decrypted external to memory such that the CPU never receives or processes the decryption), we can make this an order of magnitude harder to crack by using a random value that is purged in a separate process when the TTL triggers.
I'm going to go with that approach and also list out the limitations of what we're doing here (for what it's worth, other OSS implementations use even lower security grades to the time window encryption; going with the random encryption value upgrades this to well beyond anything else that I can find).
The case now is, at worst, with someone who has root access to the tracking server, they could theoretically get cached keys that are present in the cache within the time window iff they find them in the memory dump AND they can figure out which bytearray represents the random decryption key. This now effectively becomes finding the right two needles in a stack of needles (not impossible, but highly implausible that anyone would want to spend weeks going through bytearray permutations to find 2 bytearrays in a sea of bytearrays to satisfy decryption).
This satisfies the defense in depth requirements for Mitre's posting (reference included in the docstrings).

TomeHirata · 2025-12-09T02:26:34Z

mlflow/store/tracking/_secret_cache.py

+            current_bucket = self._get_time_bucket()
+
+            # Purge expired bucket keys (more than 1 bucket old)
+            expired = [b for b in self._bucket_keys if abs(current_bucket - b) > 1]


Shouldn't we have a separate thread that continuously clean up the expired bucket keys? Otherwise expired keys stay on memory until encrypt or decrypt is called again (probably not a huge blocker, but just wondered).

Totally - sorry for not including the separate thread on the last push (I forgot to push the stack early last night which had this change included - it's critical to have the timed daemon thread in case the cache isn't hit after n time period (otherwise the cache could stay warm far past the TTL).

TomeHirata · 2025-12-09T02:28:47Z

mlflow/store/tracking/_secret_cache.py

+                del self._bucket_keys[b]
+
+            # For decryption of slightly old buckets, return existing key if available
+            if time_bucket in self._bucket_keys:


nit: If we store up to one bucket key, we can have active_bucket: int and active_bucket_key: bytes separately instead of dict[int, bytes]

great call - I'll create the typed entries since we only have 4 values by design.

TomeHirata

Left a suggestion on a potential risky case, otherwise LGTM

Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>

BenWilson2 changed the title ~~add secure cache layer~~ [Endpoints] [8/x] Add credential cache Nov 24, 2025

BenWilson2 marked this pull request as ready for review November 24, 2025 22:04

BenWilson2 force-pushed the stack/endpoints/cache branch from af09243 to 8f2a8c9 Compare November 24, 2025 22:19

github-actions bot added area/tracking Tracking service, tracking client APIs, autologging rn/feature Mention under Features in Changelogs. labels Nov 24, 2025

BenWilson2 force-pushed the stack/endpoints/cache branch 2 times, most recently from a94d442 to db7722b Compare November 26, 2025 03:40

This was referenced Nov 26, 2025

[Endpoints] [11/x] Main Gateway page and create endpoint #19042

Closed

[Endpoints] [12/x] Add Left hand nav and management stub #19070

Closed

[Endpoints] [13/x] Add the API Key management pages #19071

Closed

BenWilson2 force-pushed the stack/endpoints/cache branch from db7722b to 07e10ca Compare November 27, 2025 06:31

BenWilson2 mentioned this pull request Nov 27, 2025

[Endpoints] [14/x] Models management UI #19074

Closed

29 tasks

BenWilson2 force-pushed the stack/endpoints/cache branch 2 times, most recently from bac5b39 to 3ea4110 Compare December 2, 2025 02:14

BenWilson2 mentioned this pull request Dec 2, 2025

[Endpoints] [14/x] Add encryption availability endpoints #19176

Closed

29 tasks

BenWilson2 force-pushed the stack/endpoints/cache branch 2 times, most recently from 3f2d368 to 133bfe7 Compare December 2, 2025 22:49

BenWilson2 force-pushed the stack/endpoints/cache branch 3 times, most recently from d680b39 to f0df939 Compare December 4, 2025 00:27

github-actions bot requested review from TomeHirata, WeichenXu123, kevin-lyn, serena-ruan and xq-yin December 5, 2025 06:53

BenWilson2 force-pushed the stack/endpoints/cache branch 9 times, most recently from 154a77e to 9447a37 Compare December 7, 2025 01:57

TomeHirata reviewed Dec 8, 2025

View reviewed changes

BenWilson2 force-pushed the stack/endpoints/cache branch 2 times, most recently from 0ed7d02 to e26b88a Compare December 8, 2025 21:49

BenWilson2 requested a review from TomeHirata December 8, 2025 21:49

TomeHirata reviewed Dec 9, 2025

View reviewed changes

TomeHirata approved these changes Dec 9, 2025

View reviewed changes

add secure cache layer

5187652

Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>

BenWilson2 force-pushed the stack/endpoints/cache branch from e26b88a to 5187652 Compare December 9, 2025 20:26

BenWilson2 added this pull request to the merge queue Dec 9, 2025

Merged via the queue into mlflow:master with commit e95a16b Dec 9, 2025
69 of 71 checks passed

BenWilson2 deleted the stack/endpoints/cache branch December 9, 2025 21:55


		from mlflow.utils.crypto import decrypt_with_aes_gcm, encrypt_with_aes_gcm

		MIN_TTL = 10

Conversation

BenWilson2 commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🥞 Stacked PR

Related Issues/PRs

What changes are proposed in this pull request?

How is this PR tested?

Does this PR require documentation update?

Release Notes

Is this a user-facing change?

What component(s), interfaces, languages, and integrations does this PR affect?

How should the PR be classified in the release notes? Choose one:

Should this PR be included in the next patch release?

Uh oh!

github-actions bot commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomeHirata Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomeHirata left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

BenWilson2 commented Nov 24, 2025 •

edited

Loading

github-actions bot commented Dec 2, 2025 •

edited

Loading

TomeHirata Dec 8, 2025 •

edited

Loading