Skip to content

[ML] Adding missing endpoint information to cluster state#138934

Merged
jonathan-buttner merged 8 commits intoelastic:mainfrom
jonathan-buttner:ia-fix-eis-endpoints
Dec 3, 2025
Merged

[ML] Adding missing endpoint information to cluster state#138934
jonathan-buttner merged 8 commits intoelastic:mainfrom
jonathan-buttner:ia-fix-eis-endpoints

Conversation

@jonathan-buttner
Copy link
Copy Markdown
Contributor

@jonathan-buttner jonathan-buttner commented Dec 2, 2025

This PR fixes a bug where the EIS preconfigured endpoints would appear to no longer be contained in the ModelRegistry. This occurs because prior to making the authorization polling logic occur within a persistent task, the endpoints were stored in an in memory map within the ModelRegistry. After the persistent task logic was added they are moved to cluster state. Adding them to cluster state only happens if the endpoint does not exist.

This PR will perform a check to see if an endpoint exists in the backing index and not in the cluster state. If so, we add it to the cluster state.

This should also fix the error messages we're seeing:

Failed to store document id: [model_.elser-2-elastic] inference id: [.elser-2-elastic] index: [.secrets-inference] bulk failure message [[.secrets-inference/q][[.secrets-inference][0]] org.elasticsearch.index.engine.VersionConflictEngineException: [model_.elser-2-elastic]: version conflict, document already exists (current version [1])]

@jonathan-buttner jonathan-buttner added >non-issue :ml Machine learning Team:ML Meta label for the ML team v9.3.0 labels Dec 2, 2025
Copy link
Copy Markdown
Contributor

@DonalEvans DonalEvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No major changes, just a couple of small test things.

One thing I'm not clear on is that if the index already contains the endpoint (which is what causes the VersionConflictEngineException), does what's in the index need to be updated in any way to match what now gets stored in the cluster state? Or is it guaranteed that what's in the indices will match what ends up in the cluster state?

@jonathan-buttner
Copy link
Copy Markdown
Contributor Author

One thing I'm not clear on is that if the index already contains the endpoint (which is what causes the VersionConflictEngineException), does what's in the index need to be updated in any way to match what now gets stored in the cluster state? Or is it guaranteed that what's in the indices will match what ends up in the cluster state?

The logic should be retrieving what's in the index and storing it in the cluster state. If it's not doing that let me know haha.
So we shouldn't need to update anything in the index. The cluster state and index should match now.

Or is it guaranteed that what's in the indices will match what ends up in the cluster state?

Yeah so I'm intentionally not using the contents of what the authorization endpoint is sending us. I'm 99% sure they'll be the same but it's best to use whatever was originally stored in the index that way it's consistent.

In the future if we need to migrate the indices we'll need to write logic to handle that specifically. It'll be more complicated because we'll have to update the indices.

@jonathan-buttner jonathan-buttner marked this pull request as ready for review December 3, 2025 00:44
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/ml-core (Team:ML)

@DonalEvans
Copy link
Copy Markdown
Contributor

The logic should be retrieving what's in the index and storing it in the cluster state.

You're right, I think I lost track of where stuff was coming from what with all the listeners. One day I'll be able to understand code that uses them properly the first time I look at it. I hope.

Out of curiosity, would we have any way of knowing if the authorization endpoint returned something that didn't match with what was in the index? Or would we just start seeing weird failures? Could it be worth putting in a check and a log line just in case, or is that overkill?

@jonathan-buttner
Copy link
Copy Markdown
Contributor Author

jonathan-buttner commented Dec 3, 2025

Out of curiosity, would we have any way of knowing if the authorization endpoint returned something that didn't match with what was in the index? Or would we just start seeing weird failures? Could it be worth putting in a check and a log line just in case, or is that overkill?

Hmm. After this PR is merged, and everything that exists in the index should exist in cluster state. When the polling logic gets a response, it will check cluster state to see if any of the authorized endpoints don't exist in cluster state (aka they are new). If cluster state contains an endpoint id already then the polling logic doesn't attempt to store it.

Unfortunately cluster state doesn't include everything that the backing index does (for example no chunking settings). It only contains the MinimalServiceSettings (service name, task type, dimensions, similarity, and element type).

I could add a log that compares those.

Another option is that we don't rely on cluster state, instead we could retrieve all the inference endpoints from the index and then do the comparison 🤔.

I guess that'd spend a few more cycles because it'd need to read from the index but, it's only every 15 minutes or so.

Copy link
Copy Markdown
Contributor

@timgrein timgrein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the fix :shipit:
Added only some minor comments/asks for explanations in the code :) Being mindful of getting this in before FF these can also be addressed after this initial PR merges.

if (defaultConfigIds.containsKey(inferenceEntityId)) {
return true;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a comment here, that this the place, where we check the cluster state?

() -> delegate.onFailure(
new ElasticsearchStatusException(
format(
"Failed to add the inference endpoints %s. The service may be in an "
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Failed to add the inference endpoints %s. The service may be in an "
"Failed to add the inference endpoints [%s]. The service may be in an "

nit: I think the ES codebase usually uses [...] for showing values

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think because this is a set it will add the brackets already.

var cleanupListener = addModelMetadataTaskListener.delegateResponse((delegate, exc) -> {
logger.atWarn()
.withThrowable(exc)
.log("Failed to add minimal service settings to cluster state for inference endpoints {}", inferenceIdsSet);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.log("Failed to add minimal service settings to cluster state for inference endpoints {}", inferenceIdsSet);
.log("Failed to add minimal service settings to cluster state for inference endpoints [{}]", inferenceIdsSet);

Not sure, if log messages also usually use the [element1, ..., elementN] notation. Feel free to ignore

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think this will print the brackets because it's a set and that's the default toString implementation.


metadataTaskQueue.submitTask(
format("add model metadata for %s", inferenceIdsSet),
new AddModelMetadataTask(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds the model metadata only to the cluster state, right? If so, could we add a comment that this only updates the cluster state and not the index?

Could also be beneficial for readability to extract it to a private method "addModelMetadataToClusterState(...)".

.addListener(listener);
}

private void handleOutOfSyncEndpoints(ResponseInfo responseInfo, ActionListener<Void> listener, TimeValue timeout) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add 1-2 sentences, what we consider an out-of-sync endpoint? Just that others not too familiar with this part of the codebase get a sense of what this means.

}

var fixOutOfSyncListener = ActionListener.<GetInferenceModelAction.Response>wrap((response) -> {
var endpointsToFix = new ArrayList<ModelAndSettings>();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
var endpointsToFix = new ArrayList<ModelAndSettings>();
var outOfSyncEndpoints = new ArrayList<ModelAndSettings>();

very small consistency nit: then it's easier to grasp immediately that these are simply the out of sync endpoints

@jonathan-buttner jonathan-buttner enabled auto-merge (squash) December 3, 2025 14:58
@jonathan-buttner jonathan-buttner merged commit e93fcac into elastic:main Dec 3, 2025
34 checks passed
@jonathan-buttner jonathan-buttner deleted the ia-fix-eis-endpoints branch December 3, 2025 16:36
dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this pull request Mar 4, 2026
… version changed

This PR modifies the `AuthorizationPoller` so that it selects to persist endpoints that:

  - are new
  - have a different fingerprint
  - have a newer version

Updating endpoints that have a different fingerprint allows dynamic updates when EIS modifies the endpoint
metadata.

Updating endpoints that have a different version allows persisting `EndpointMetadata` fields that could not
be parsed before.

In addition, in this PR I have removed the logic for handling out-of-sync endpoints added in elastic#138934 as it
is no longer needed. This was necessary for authorized endpoints only. For those, if they are not in
cluster state we'll now overwrite the doc in the index and update the cluster state. This ammends the
out-of-sync problem if it ever occurs.
dimitris-athanasiou added a commit that referenced this pull request Mar 6, 2026
… version changed (#143567)

This PR modifies the `AuthorizationPoller` so that it selects to persist endpoints that:

  - are new
  - have a different fingerprint
  - have a newer version

Updating endpoints that have a different fingerprint allows dynamic updates when EIS modifies the endpoint
metadata.

Updating endpoints that have a different version allows persisting `EndpointMetadata` fields that could not
be parsed before.

In addition, in this PR I have removed the logic for handling out-of-sync endpoints added in #138934 as it
is no longer needed. This was necessary for authorized endpoints only. For those, if they are not in
cluster state we'll now overwrite the doc in the index and update the cluster state. This ammends the
out-of-sync problem if it ever occurs.
spinscale pushed a commit to spinscale/elasticsearch that referenced this pull request Mar 6, 2026
… version changed (elastic#143567)

This PR modifies the `AuthorizationPoller` so that it selects to persist endpoints that:

  - are new
  - have a different fingerprint
  - have a newer version

Updating endpoints that have a different fingerprint allows dynamic updates when EIS modifies the endpoint
metadata.

Updating endpoints that have a different version allows persisting `EndpointMetadata` fields that could not
be parsed before.

In addition, in this PR I have removed the logic for handling out-of-sync endpoints added in elastic#138934 as it
is no longer needed. This was necessary for authorized endpoints only. For those, if they are not in
cluster state we'll now overwrite the doc in the index and update the cluster state. This ammends the
out-of-sync problem if it ever occurs.
sidosera pushed a commit to sidosera/elasticsearch that referenced this pull request Mar 6, 2026
… version changed (elastic#143567)

This PR modifies the `AuthorizationPoller` so that it selects to persist endpoints that:

  - are new
  - have a different fingerprint
  - have a newer version

Updating endpoints that have a different fingerprint allows dynamic updates when EIS modifies the endpoint
metadata.

Updating endpoints that have a different version allows persisting `EndpointMetadata` fields that could not
be parsed before.

In addition, in this PR I have removed the logic for handling out-of-sync endpoints added in elastic#138934 as it
is no longer needed. This was necessary for authorized endpoints only. For those, if they are not in
cluster state we'll now overwrite the doc in the index and update the cluster state. This ammends the
out-of-sync problem if it ever occurs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:ml Machine learning >non-issue Team:ML Meta label for the ML team v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants