[ML] Adding missing endpoint information to cluster state by jonathan-buttner · Pull Request #138934 · elastic/elasticsearch

jonathan-buttner · 2025-12-02T21:05:49Z

This PR fixes a bug where the EIS preconfigured endpoints would appear to no longer be contained in the ModelRegistry. This occurs because prior to making the authorization polling logic occur within a persistent task, the endpoints were stored in an in memory map within the ModelRegistry. After the persistent task logic was added they are moved to cluster state. Adding them to cluster state only happens if the endpoint does not exist.

This PR will perform a check to see if an endpoint exists in the backing index and not in the cluster state. If so, we add it to the cluster state.

This should also fix the error messages we're seeing:

Failed to store document id: [model_.elser-2-elastic] inference id: [.elser-2-elastic] index: [.secrets-inference] bulk failure message [[.secrets-inference/q][[.secrets-inference][0]] org.elasticsearch.index.engine.VersionConflictEngineException: [model_.elser-2-elastic]: version conflict, document already exists (current version [1])]

…is-endpoints

DonalEvans

No major changes, just a couple of small test things.

One thing I'm not clear on is that if the index already contains the endpoint (which is what causes the VersionConflictEngineException), does what's in the index need to be updated in any way to match what now gets stored in the cluster state? Or is it guaranteed that what's in the indices will match what ends up in the cluster state?

.../internalClusterTest/java/org/elasticsearch/xpack/inference/integration/ModelRegistryIT.java

jonathan-buttner · 2025-12-02T22:58:56Z

One thing I'm not clear on is that if the index already contains the endpoint (which is what causes the VersionConflictEngineException), does what's in the index need to be updated in any way to match what now gets stored in the cluster state? Or is it guaranteed that what's in the indices will match what ends up in the cluster state?

The logic should be retrieving what's in the index and storing it in the cluster state. If it's not doing that let me know haha.
So we shouldn't need to update anything in the index. The cluster state and index should match now.

Or is it guaranteed that what's in the indices will match what ends up in the cluster state?

Yeah so I'm intentionally not using the contents of what the authorization endpoint is sending us. I'm 99% sure they'll be the same but it's best to use whatever was originally stored in the index that way it's consistent.

In the future if we need to migrate the indices we'll need to write logic to handle that specifically. It'll be more complicated because we'll have to update the indices.

…is-endpoints

elasticsearchmachine · 2025-12-03T00:45:18Z

Pinging @elastic/ml-core (Team:ML)

DonalEvans · 2025-12-03T00:55:03Z

The logic should be retrieving what's in the index and storing it in the cluster state.

You're right, I think I lost track of where stuff was coming from what with all the listeners. One day I'll be able to understand code that uses them properly the first time I look at it. I hope.

Out of curiosity, would we have any way of knowing if the authorization endpoint returned something that didn't match with what was in the index? Or would we just start seeing weird failures? Could it be worth putting in a check and a log line just in case, or is that overkill?

jonathan-buttner · 2025-12-03T01:05:18Z

Out of curiosity, would we have any way of knowing if the authorization endpoint returned something that didn't match with what was in the index? Or would we just start seeing weird failures? Could it be worth putting in a check and a log line just in case, or is that overkill?

Hmm. After this PR is merged, and everything that exists in the index should exist in cluster state. When the polling logic gets a response, it will check cluster state to see if any of the authorized endpoints don't exist in cluster state (aka they are new). If cluster state contains an endpoint id already then the polling logic doesn't attempt to store it.

Unfortunately cluster state doesn't include everything that the backing index does (for example no chunking settings). It only contains the MinimalServiceSettings (service name, task type, dimensions, similarity, and element type).

I could add a log that compares those.

Another option is that we don't rely on cluster state, instead we could retrieve all the inference endpoints from the index and then do the comparison 🤔.

I guess that'd spend a few more cycles because it'd need to read from the index but, it's only every 15 minutes or so.

timgrein

LGTM, thanks for the fix
Added only some minor comments/asks for explanations in the code :) Being mindful of getting this in before FF these can also be addressed after this initial PR merges.

timgrein · 2025-12-03T11:22:45Z