Exclude Default Inference Endpoints from Cluster State Storage by jimczi · Pull Request #125242 · elastic/elasticsearch

jimczi · 2025-03-19T17:48:29Z

When retrieving a default inference endpoint for the first time, the system automatically creates the endpoint. However, unlike the put inference model action, the get action does not redirect the request to the master node.

Since #121106, we rely on the assumption that every model creation (put model) must run on the master node, as it modifies the cluster state. However, this assumption led to a bug where the get action tries to store default inference endpoints from a different node.

This change resolves the issue by preventing default inference endpoints from being added to the cluster state. These endpoints are not strictly needed there, as they are already reported by inference services upon startup.

Note: This bug did not prevent the default endpoints from being used, but it caused repeated attempts to store them in the index, resulting in logging errors on every usage. This is an unreleased bug so marking it as non-issue.

When retrieving a default inference endpoint for the first time, the system automatically creates the endpoint. However, unlike the `put inference model` action, the `get` action does not redirect the request to the master node. Since elastic#121106, we rely on the assumption that every model creation (`put model`) must run on the master node, as it modifies the cluster state. However, this assumption led to a bug where the get action tries to store default inference endpoints from a different node. This change resolves the issue by preventing default inference endpoints from being added to the cluster state. These endpoints are not strictly needed there, as they are already reported by inference services upon startup. **Note:** This bug did not prevent the default endpoints from being used, but it caused repeated attempts to store them in the index, resulting in logging errors on every usage.

elasticsearchmachine · 2025-03-19T17:48:54Z

Pinging @elastic/ml-core (Team:ML)

jonathan-buttner · 2025-03-19T17:56:08Z

These endpoints are not strictly needed there, as they are already reported by inference services upon startup.

Is the reason because other plugins can access the default endpoints directly in memory through the inference plugin?

Also we should backport this change right?

jimczi · 2025-03-19T17:57:44Z

Is the reason because other plugins can access the default endpoints directly in memory through the inference plugin?

Yes, through the local model registry since all services register their default models there.

Also we should backport this change right?

The main change is not backported yet so I'll manually add it in the current pr

jimczi · 2025-03-19T17:59:23Z

...ore/src/main/java/org/elasticsearch/xpack/core/inference/action/GetInferenceModelAction.java

-
-            if (out.getTransportVersion().onOrAfter(TransportVersions.INFERENCE_MODEL_REGISTRY_METADATA)) {
-                out.writeBoolean(returnMinimalConfig);
-            }


This is a leftover from #121106. It's ok to remove since the transport serialisation is not used on HandledTransportAction that executes directly on the receiving node.

…ic#125242) When retrieving a default inference endpoint for the first time, the system automatically creates the endpoint. However, unlike the `put inference model` action, the `get` action does not redirect the request to the master node. Since elastic#121106, we rely on the assumption that every model creation (`put model`) must run on the master node, as it modifies the cluster state. However, this assumption led to a bug where the get action tries to store default inference endpoints from a different node. This change resolves the issue by preventing default inference endpoints from being added to the cluster state. These endpoints are not strictly needed there, as they are already reported by inference services upon startup. **Note:** This bug did not prevent the default endpoints from being used, but it caused repeated attempts to store them in the index, resulting in logging errors on every usage.

* Add ModelRegistryMetadata to Cluster State (#121106) This commit integrates `MinimalServiceSettings` (introduced in #120560) into the cluster state for all registered models in the `ModelRegistry`. These settings allow consumers to access configuration details without requiring asynchronous calls to retrieve full model configurations. To ensure consistency, the cluster state metadata must remain synchronized with the models in the inference index. If a mismatch is detected during startup, the master node performs an upgrade to load all model settings from the index. * fix test compil * fix serialisation * Exclude Default Inference Endpoints from Cluster State Storage (#125242) When retrieving a default inference endpoint for the first time, the system automatically creates the endpoint. However, unlike the `put inference model` action, the `get` action does not redirect the request to the master node. Since #121106, we rely on the assumption that every model creation (`put model`) must run on the master node, as it modifies the cluster state. However, this assumption led to a bug where the get action tries to store default inference endpoints from a different node. This change resolves the issue by preventing default inference endpoints from being added to the cluster state. These endpoints are not strictly needed there, as they are already reported by inference services upon startup. **Note:** This bug did not prevent the default endpoints from being used, but it caused repeated attempts to store them in the index, resulting in logging errors on every usage.

…ting The Elastic inference service removes the default models at startup if the node cannot access EIS. Since elastic#125242 we don't store default models in the cluster state but we still try to delete them. This change ensures that we don't try to update the cluster state when a default model is deleted since the delete is not performed on the master node and default models are never stored in the cluster state.

…ting (#125369) The Elastic inference service removes the default models at startup if the node cannot access EIS. Since #125242 we don't store default models in the cluster state but we still try to delete them. This change ensures that we don't try to update the cluster state when a default model is deleted since the delete is not performed on the master node and default models are never stored in the cluster state.

…ic#125242) When retrieving a default inference endpoint for the first time, the system automatically creates the endpoint. However, unlike the `put inference model` action, the `get` action does not redirect the request to the master node. Since elastic#121106, we rely on the assumption that every model creation (`put model`) must run on the master node, as it modifies the cluster state. However, this assumption led to a bug where the get action tries to store default inference endpoints from a different node. This change resolves the issue by preventing default inference endpoints from being added to the cluster state. These endpoints are not strictly needed there, as they are already reported by inference services upon startup. **Note:** This bug did not prevent the default endpoints from being used, but it caused repeated attempts to store them in the index, resulting in logging errors on every usage.

…ting (elastic#125369) The Elastic inference service removes the default models at startup if the node cannot access EIS. Since elastic#125242 we don't store default models in the cluster state but we still try to delete them. This change ensures that we don't try to update the cluster state when a default model is deleted since the delete is not performed on the master node and default models are never stored in the cluster state.

…ting (#125369) (#125597) The Elastic inference service removes the default models at startup if the node cannot access EIS. Since #125242 we don't store default models in the cluster state but we still try to delete them. This change ensures that we don't try to update the cluster state when a default model is deleted since the delete is not performed on the master node and default models are never stored in the cluster state.

…ic#125242) When retrieving a default inference endpoint for the first time, the system automatically creates the endpoint. However, unlike the `put inference model` action, the `get` action does not redirect the request to the master node. Since elastic#121106, we rely on the assumption that every model creation (`put model`) must run on the master node, as it modifies the cluster state. However, this assumption led to a bug where the get action tries to store default inference endpoints from a different node. This change resolves the issue by preventing default inference endpoints from being added to the cluster state. These endpoints are not strictly needed there, as they are already reported by inference services upon startup. **Note:** This bug did not prevent the default endpoints from being used, but it caused repeated attempts to store them in the index, resulting in logging errors on every usage.

…ting (elastic#125369) The Elastic inference service removes the default models at startup if the node cannot access EIS. Since elastic#125242 we don't store default models in the cluster state but we still try to delete them. This change ensures that we don't try to update the cluster state when a default model is deleted since the delete is not performed on the master node and default models are never stored in the cluster state.

jimczi added >non-issue :ml Machine learning v9.1.0 labels Mar 19, 2025

jimczi requested review from davidkyle and jonathan-buttner March 19, 2025 17:48

elasticsearchmachine added the Team:ML Meta label for the ML team label Mar 19, 2025

jonathan-buttner approved these changes Mar 19, 2025

View reviewed changes

jimczi commented Mar 19, 2025

View reviewed changes

Merge branch 'main' into model_regsitry_default_endpoints

4b757b0

jimczi merged commit 2f1c857 into elastic:main Mar 19, 2025
17 checks passed

jimczi deleted the model_regsitry_default_endpoints branch March 19, 2025 20:19

jimczi mentioned this pull request Mar 21, 2025

Prevent default inference model to update the cluster state when deleting #125369

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exclude Default Inference Endpoints from Cluster State Storage#125242

Exclude Default Inference Endpoints from Cluster State Storage#125242
jimczi merged 2 commits intoelastic:mainfrom
jimczi:model_regsitry_default_endpoints

jimczi commented Mar 19, 2025

Uh oh!

elasticsearchmachine commented Mar 19, 2025

Uh oh!

jonathan-buttner commented Mar 19, 2025

Uh oh!

jimczi commented Mar 19, 2025

Uh oh!

jimczi Mar 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jimczi commented Mar 19, 2025

Uh oh!

elasticsearchmachine commented Mar 19, 2025

Uh oh!

jonathan-buttner commented Mar 19, 2025

Uh oh!

jimczi commented Mar 19, 2025

Uh oh!

jimczi Mar 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants