[ML] Adding missing endpoint information to cluster state#138934
[ML] Adding missing endpoint information to cluster state#138934jonathan-buttner merged 8 commits intoelastic:mainfrom
Conversation
DonalEvans
left a comment
There was a problem hiding this comment.
No major changes, just a couple of small test things.
One thing I'm not clear on is that if the index already contains the endpoint (which is what causes the VersionConflictEngineException), does what's in the index need to be updated in any way to match what now gets stored in the cluster state? Or is it guaranteed that what's in the indices will match what ends up in the cluster state?
.../internalClusterTest/java/org/elasticsearch/xpack/inference/integration/ModelRegistryIT.java
Outdated
Show resolved
Hide resolved
.../internalClusterTest/java/org/elasticsearch/xpack/inference/integration/ModelRegistryIT.java
Show resolved
Hide resolved
.../internalClusterTest/java/org/elasticsearch/xpack/inference/integration/ModelRegistryIT.java
Outdated
Show resolved
Hide resolved
The logic should be retrieving what's in the index and storing it in the cluster state. If it's not doing that let me know haha.
Yeah so I'm intentionally not using the contents of what the authorization endpoint is sending us. I'm 99% sure they'll be the same but it's best to use whatever was originally stored in the index that way it's consistent. In the future if we need to migrate the indices we'll need to write logic to handle that specifically. It'll be more complicated because we'll have to update the indices. |
|
Pinging @elastic/ml-core (Team:ML) |
You're right, I think I lost track of where stuff was coming from what with all the listeners. One day I'll be able to understand code that uses them properly the first time I look at it. I hope. Out of curiosity, would we have any way of knowing if the authorization endpoint returned something that didn't match with what was in the index? Or would we just start seeing weird failures? Could it be worth putting in a check and a log line just in case, or is that overkill? |
Hmm. After this PR is merged, and everything that exists in the index should exist in cluster state. When the polling logic gets a response, it will check cluster state to see if any of the authorized endpoints don't exist in cluster state (aka they are new). If cluster state contains an endpoint id already then the polling logic doesn't attempt to store it. Unfortunately cluster state doesn't include everything that the backing index does (for example no chunking settings). It only contains the I could add a log that compares those. Another option is that we don't rely on cluster state, instead we could retrieve all the inference endpoints from the index and then do the comparison 🤔. I guess that'd spend a few more cycles because it'd need to read from the index but, it's only every 15 minutes or so. |
| if (defaultConfigIds.containsKey(inferenceEntityId)) { | ||
| return true; | ||
| } | ||
|
|
There was a problem hiding this comment.
Can we add a comment here, that this the place, where we check the cluster state?
| () -> delegate.onFailure( | ||
| new ElasticsearchStatusException( | ||
| format( | ||
| "Failed to add the inference endpoints %s. The service may be in an " |
There was a problem hiding this comment.
| "Failed to add the inference endpoints %s. The service may be in an " | |
| "Failed to add the inference endpoints [%s]. The service may be in an " |
nit: I think the ES codebase usually uses [...] for showing values
There was a problem hiding this comment.
Actually I think because this is a set it will add the brackets already.
| var cleanupListener = addModelMetadataTaskListener.delegateResponse((delegate, exc) -> { | ||
| logger.atWarn() | ||
| .withThrowable(exc) | ||
| .log("Failed to add minimal service settings to cluster state for inference endpoints {}", inferenceIdsSet); |
There was a problem hiding this comment.
| .log("Failed to add minimal service settings to cluster state for inference endpoints {}", inferenceIdsSet); | |
| .log("Failed to add minimal service settings to cluster state for inference endpoints [{}]", inferenceIdsSet); |
Not sure, if log messages also usually use the [element1, ..., elementN] notation. Feel free to ignore
There was a problem hiding this comment.
Yeah I think this will print the brackets because it's a set and that's the default toString implementation.
|
|
||
| metadataTaskQueue.submitTask( | ||
| format("add model metadata for %s", inferenceIdsSet), | ||
| new AddModelMetadataTask( |
There was a problem hiding this comment.
This adds the model metadata only to the cluster state, right? If so, could we add a comment that this only updates the cluster state and not the index?
Could also be beneficial for readability to extract it to a private method "addModelMetadataToClusterState(...)".
| .addListener(listener); | ||
| } | ||
|
|
||
| private void handleOutOfSyncEndpoints(ResponseInfo responseInfo, ActionListener<Void> listener, TimeValue timeout) { |
There was a problem hiding this comment.
Could you add 1-2 sentences, what we consider an out-of-sync endpoint? Just that others not too familiar with this part of the codebase get a sense of what this means.
| } | ||
|
|
||
| var fixOutOfSyncListener = ActionListener.<GetInferenceModelAction.Response>wrap((response) -> { | ||
| var endpointsToFix = new ArrayList<ModelAndSettings>(); |
There was a problem hiding this comment.
| var endpointsToFix = new ArrayList<ModelAndSettings>(); | |
| var outOfSyncEndpoints = new ArrayList<ModelAndSettings>(); |
very small consistency nit: then it's easier to grasp immediately that these are simply the out of sync endpoints
… version changed This PR modifies the `AuthorizationPoller` so that it selects to persist endpoints that: - are new - have a different fingerprint - have a newer version Updating endpoints that have a different fingerprint allows dynamic updates when EIS modifies the endpoint metadata. Updating endpoints that have a different version allows persisting `EndpointMetadata` fields that could not be parsed before. In addition, in this PR I have removed the logic for handling out-of-sync endpoints added in elastic#138934 as it is no longer needed. This was necessary for authorized endpoints only. For those, if they are not in cluster state we'll now overwrite the doc in the index and update the cluster state. This ammends the out-of-sync problem if it ever occurs.
… version changed (#143567) This PR modifies the `AuthorizationPoller` so that it selects to persist endpoints that: - are new - have a different fingerprint - have a newer version Updating endpoints that have a different fingerprint allows dynamic updates when EIS modifies the endpoint metadata. Updating endpoints that have a different version allows persisting `EndpointMetadata` fields that could not be parsed before. In addition, in this PR I have removed the logic for handling out-of-sync endpoints added in #138934 as it is no longer needed. This was necessary for authorized endpoints only. For those, if they are not in cluster state we'll now overwrite the doc in the index and update the cluster state. This ammends the out-of-sync problem if it ever occurs.
… version changed (elastic#143567) This PR modifies the `AuthorizationPoller` so that it selects to persist endpoints that: - are new - have a different fingerprint - have a newer version Updating endpoints that have a different fingerprint allows dynamic updates when EIS modifies the endpoint metadata. Updating endpoints that have a different version allows persisting `EndpointMetadata` fields that could not be parsed before. In addition, in this PR I have removed the logic for handling out-of-sync endpoints added in elastic#138934 as it is no longer needed. This was necessary for authorized endpoints only. For those, if they are not in cluster state we'll now overwrite the doc in the index and update the cluster state. This ammends the out-of-sync problem if it ever occurs.
… version changed (elastic#143567) This PR modifies the `AuthorizationPoller` so that it selects to persist endpoints that: - are new - have a different fingerprint - have a newer version Updating endpoints that have a different fingerprint allows dynamic updates when EIS modifies the endpoint metadata. Updating endpoints that have a different version allows persisting `EndpointMetadata` fields that could not be parsed before. In addition, in this PR I have removed the logic for handling out-of-sync endpoints added in elastic#138934 as it is no longer needed. This was necessary for authorized endpoints only. For those, if they are not in cluster state we'll now overwrite the doc in the index and update the cluster state. This ammends the out-of-sync problem if it ever occurs.
This PR fixes a bug where the EIS preconfigured endpoints would appear to no longer be contained in the
ModelRegistry. This occurs because prior to making the authorization polling logic occur within a persistent task, the endpoints were stored in an in memory map within theModelRegistry. After the persistent task logic was added they are moved to cluster state. Adding them to cluster state only happens if the endpoint does not exist.This PR will perform a check to see if an endpoint exists in the backing index and not in the cluster state. If so, we add it to the cluster state.
This should also fix the error messages we're seeing: