[Endpoints] [8/x] Add credential cache#19014
Conversation
af09243 to
8f2a8c9
Compare
a94d442 to
db7722b
Compare
db7722b to
07e10ca
Compare
bac5b39 to
3ea4110
Compare
3f2d368 to
133bfe7
Compare
|
Documentation preview for 5187652 is available at: More info
|
d680b39 to
f0df939
Compare
154a77e to
9447a37
Compare
|
|
||
| from mlflow.utils.crypto import decrypt_with_aes_gcm, encrypt_with_aes_gcm | ||
|
|
||
| MIN_TTL = 10 |
There was a problem hiding this comment.
nit: can we add "_" prefix for these internal consts?
mlflow/server/constants.py
Outdated
| SECRETS_CACHE_TTL_ENV_VAR = "_MLFLOW_SERVER_SECRETS_CACHE_TTL" | ||
|
|
||
| # Maximum number of entries in server-side secrets cache (default 1000 entries) | ||
| SECRETS_CACHE_MAX_SIZE_ENV_VAR = "_MLFLOW_SERVER_SECRETS_CACHE_MAX_SIZE" |
There was a problem hiding this comment.
It is expected for users configure these variables, so technically these variables are not "internal", correct? Then shall we remove "_" from the env var name?
| session.flush() | ||
| session.refresh(sql_secret) | ||
|
|
||
| self._invalidate_secret_cache() |
There was a problem hiding this comment.
Do we want to invalidate all cached records every time? What is the downside of per-key invalidation?
There was a problem hiding this comment.
Mostly complexity. Handling per-record TTL and invalidation only for the mutated key just simply adds complexity for an action that is (hopefully) so infrequent that it doesn't really make sense to add that complexity.
There was a problem hiding this comment.
Make sense, let's start with the complete invalidation and then revisit if we hit performance issues
| def _derive_bucket_key(self, time_bucket: int) -> bytes: | ||
| bucket_bytes = time_bucket.to_bytes(8, byteorder="big") | ||
| return hashlib.sha256(self._base_key + bucket_bytes).digest() |
There was a problem hiding this comment.
[sorry if I misunderstand the time-bucketed ephemeral encryption flow]
With the current secret derivation I wonder if it's not ensured that secrets in time-bucketed cache are "permanently unrecoverable after a short time window even if an attacker gains access to the base key via a memory dump".
Let's say a secret is stored in the cache at 00:00:00 and ttl_seconds=60, and then attackers dump the memory at 00:05:00. Attackers figure out the base key and encrypted secret strings at 00:10:00. They can compute bucket key by trying the time bucket number one by one (10 -> 9 -> ... -> 0) and find the time bucket key that can decrypt the secrets relatively efficiently:
enc = EphemeralCacheEncryption()
enc._base_key='extracted_base_key'
time_bucket = enc._get_time_bucket()
while (True):
try:
secret = enc.decrypt('extracted_encrypted_bytes', time_bucket)
return secret
catch Exception:
time_bucket -= 1If my statement above is true, probably we should generate a time bucket key in a way that cannot be computed from base_key and time bucket, and then have a separate thread to remove the time bucket secret from memory if it's expired.
There was a problem hiding this comment.
This is a good point on the ease of cracking. While we definitely can't make this completely iron-clad (the only way to do this is through hardware-based encryption where the key is generated and decrypted external to memory such that the CPU never receives or processes the decryption), we can make this an order of magnitude harder to crack by using a random value that is purged in a separate process when the TTL triggers.
I'm going to go with that approach and also list out the limitations of what we're doing here (for what it's worth, other OSS implementations use even lower security grades to the time window encryption; going with the random encryption value upgrades this to well beyond anything else that I can find).
The case now is, at worst, with someone who has root access to the tracking server, they could theoretically get cached keys that are present in the cache within the time window iff they find them in the memory dump AND they can figure out which bytearray represents the random decryption key. This now effectively becomes finding the right two needles in a stack of needles (not impossible, but highly implausible that anyone would want to spend weeks going through bytearray permutations to find 2 bytearrays in a sea of bytearrays to satisfy decryption).
This satisfies the defense in depth requirements for Mitre's posting (reference included in the docstrings).
0ed7d02 to
e26b88a
Compare
| current_bucket = self._get_time_bucket() | ||
|
|
||
| # Purge expired bucket keys (more than 1 bucket old) | ||
| expired = [b for b in self._bucket_keys if abs(current_bucket - b) > 1] |
There was a problem hiding this comment.
Shouldn't we have a separate thread that continuously clean up the expired bucket keys? Otherwise expired keys stay on memory until encrypt or decrypt is called again (probably not a huge blocker, but just wondered).
There was a problem hiding this comment.
Totally - sorry for not including the separate thread on the last push (I forgot to push the stack early last night which had this change included - it's critical to have the timed daemon thread in case the cache isn't hit after n time period (otherwise the cache could stay warm far past the TTL).
| del self._bucket_keys[b] | ||
|
|
||
| # For decryption of slightly old buckets, return existing key if available | ||
| if time_bucket in self._bucket_keys: |
There was a problem hiding this comment.
nit: If we store up to one bucket key, we can have active_bucket: int and active_bucket_key: bytes separately instead of dict[int, bytes]
There was a problem hiding this comment.
great call - I'll create the typed entries since we only have 4 values by design.
TomeHirata
left a comment
There was a problem hiding this comment.
Left a suggestion on a potential risky case, otherwise LGTM
Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
e26b88a to
5187652
Compare
🥞 Stacked PR
Use this link to review incremental changes.
Related Issues/PRs
#xxxWhat changes are proposed in this pull request?
Add the credential cache layer to protect the DB against high-volume server requests
How is this PR tested?
Does this PR require documentation update?
Release Notes
Is this a user-facing change?
What component(s), interfaces, languages, and integrations does this PR affect?
Components
area/tracking: Tracking Service, tracking client APIs, autologgingarea/models: MLmodel format, model serialization/deserialization, flavorsarea/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registryarea/scoring: MLflow Model server, model deployment tools, Spark UDFsarea/evaluation: MLflow model evaluation features, evaluation metrics, and evaluation workflowsarea/gateway: MLflow AI Gateway client APIs, server, and third-party integrationsarea/prompts: MLflow prompt engineering features, prompt templates, and prompt managementarea/tracing: MLflow Tracing features, tracing APIs, and LLM tracing functionalityarea/projects: MLproject format, project running backendsarea/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev serverarea/build: Build and test infrastructure for MLflowarea/docs: MLflow documentation pagesHow should the PR be classified in the release notes? Choose one:
rn/none- No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" sectionrn/breaking-change- The PR will be mentioned in the "Breaking Changes" sectionrn/feature- A new user-facing feature worth mentioning in the release notesrn/bug-fix- A user-facing bug fix worth mentioning in the release notesrn/documentation- A user-facing documentation change worth mentioning in the release notesShould this PR be included in the next patch release?
Yesshould be selected for bug fixes, documentation updates, and other small changes.Noshould be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.What is a minor/patch release?
Bug fixes, doc updates and new features usually go into minor releases.
Bug fixes and doc updates usually go into patch releases.