Skip to content

Exclude Default Inference Endpoints from Cluster State Storage#125242

Merged
jimczi merged 2 commits intoelastic:mainfrom
jimczi:model_regsitry_default_endpoints
Mar 19, 2025
Merged

Exclude Default Inference Endpoints from Cluster State Storage#125242
jimczi merged 2 commits intoelastic:mainfrom
jimczi:model_regsitry_default_endpoints

Conversation

@jimczi
Copy link
Copy Markdown
Contributor

@jimczi jimczi commented Mar 19, 2025

When retrieving a default inference endpoint for the first time, the system automatically creates the endpoint. However, unlike the put inference model action, the get action does not redirect the request to the master node.

Since #121106, we rely on the assumption that every model creation (put model) must run on the master node, as it modifies the cluster state. However, this assumption led to a bug where the get action tries to store default inference endpoints from a different node.

This change resolves the issue by preventing default inference endpoints from being added to the cluster state. These endpoints are not strictly needed there, as they are already reported by inference services upon startup.

Note: This bug did not prevent the default endpoints from being used, but it caused repeated attempts to store them in the index, resulting in logging errors on every usage. This is an unreleased bug so marking it as non-issue.

When retrieving a default inference endpoint for the first time, the system automatically creates the endpoint.
However, unlike the `put inference model` action, the `get` action does not redirect the request to the master node.

Since elastic#121106, we rely on the assumption that every model creation (`put model`) must run on the master node, as it modifies the cluster state. However, this assumption led to a bug where the get action tries to store default inference endpoints from a different node.

This change resolves the issue by preventing default inference endpoints from being added to the cluster state. These endpoints are not strictly needed there, as they are already reported by inference services upon startup.

**Note:** This bug did not prevent the default endpoints from being used, but it caused repeated attempts to store them in the index, resulting in logging errors on every usage.
@jimczi jimczi added >non-issue :ml Machine learning v9.1.0 labels Mar 19, 2025
@elasticsearchmachine elasticsearchmachine added the Team:ML Meta label for the ML team label Mar 19, 2025
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/ml-core (Team:ML)

@jonathan-buttner
Copy link
Copy Markdown
Contributor

These endpoints are not strictly needed there, as they are already reported by inference services upon startup.

Is the reason because other plugins can access the default endpoints directly in memory through the inference plugin?

Also we should backport this change right?

@jimczi
Copy link
Copy Markdown
Contributor Author

jimczi commented Mar 19, 2025

Is the reason because other plugins can access the default endpoints directly in memory through the inference plugin?

Yes, through the local model registry since all services register their default models there.

Also we should backport this change right?

The main change is not backported yet so I'll manually add it in the current pr


if (out.getTransportVersion().onOrAfter(TransportVersions.INFERENCE_MODEL_REGISTRY_METADATA)) {
out.writeBoolean(returnMinimalConfig);
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a leftover from #121106. It's ok to remove since the transport serialisation is not used on HandledTransportAction that executes directly on the receiving node.

@jimczi jimczi merged commit 2f1c857 into elastic:main Mar 19, 2025
17 checks passed
@jimczi jimczi deleted the model_regsitry_default_endpoints branch March 19, 2025 20:19
jimczi added a commit to jimczi/elasticsearch that referenced this pull request Mar 19, 2025
…ic#125242)

When retrieving a default inference endpoint for the first time, the system automatically creates the endpoint.
However, unlike the `put inference model` action, the `get` action does not redirect the request to the master node.

Since elastic#121106, we rely on the assumption that every model creation (`put model`) must run on the master node, as it modifies the cluster state. However, this assumption led to a bug where the get action tries to store default inference endpoints from a different node.

This change resolves the issue by preventing default inference endpoints from being added to the cluster state. These endpoints are not strictly needed there, as they are already reported by inference services upon startup.

**Note:** This bug did not prevent the default endpoints from being used, but it caused repeated attempts to store them in the index, resulting in logging errors on every usage.
elasticsearchmachine pushed a commit that referenced this pull request Mar 20, 2025
* Add ModelRegistryMetadata to Cluster State (#121106)

This commit integrates `MinimalServiceSettings` (introduced in #120560) into the cluster state for all registered models in the `ModelRegistry`.
These settings allow consumers to access configuration details without requiring asynchronous calls to retrieve full model configurations.

To ensure consistency, the cluster state metadata must remain synchronized with the models in the inference index.
If a mismatch is detected during startup, the master node performs an upgrade to load all model settings from the index.

* fix test compil

* fix serialisation

* Exclude Default Inference Endpoints from Cluster State Storage (#125242)

When retrieving a default inference endpoint for the first time, the system automatically creates the endpoint.
However, unlike the `put inference model` action, the `get` action does not redirect the request to the master node.

Since #121106, we rely on the assumption that every model creation (`put model`) must run on the master node, as it modifies the cluster state. However, this assumption led to a bug where the get action tries to store default inference endpoints from a different node.

This change resolves the issue by preventing default inference endpoints from being added to the cluster state. These endpoints are not strictly needed there, as they are already reported by inference services upon startup.

**Note:** This bug did not prevent the default endpoints from being used, but it caused repeated attempts to store them in the index, resulting in logging errors on every usage.
jimczi added a commit to jimczi/elasticsearch that referenced this pull request Mar 21, 2025
…ting

The Elastic inference service removes the default models at startup if the node cannot access EIS.
Since elastic#125242 we don't store default models in the cluster state but we still try to delete them.
This change ensures that we don't try to update the cluster state when a default model is deleted
since the delete is not performed on the master node and default models are never stored in the cluster state.
jimczi added a commit that referenced this pull request Mar 21, 2025
…ting (#125369)

The Elastic inference service removes the default models at startup if the node cannot access EIS.
Since #125242 we don't store default models in the cluster state but we still try to delete them.
This change ensures that we don't try to update the cluster state when a default model is deleted
since the delete is not performed on the master node and default models are never stored in the cluster state.
smalyshev pushed a commit to smalyshev/elasticsearch that referenced this pull request Mar 21, 2025
…ic#125242)

When retrieving a default inference endpoint for the first time, the system automatically creates the endpoint.
However, unlike the `put inference model` action, the `get` action does not redirect the request to the master node.

Since elastic#121106, we rely on the assumption that every model creation (`put model`) must run on the master node, as it modifies the cluster state. However, this assumption led to a bug where the get action tries to store default inference endpoints from a different node.

This change resolves the issue by preventing default inference endpoints from being added to the cluster state. These endpoints are not strictly needed there, as they are already reported by inference services upon startup.

**Note:** This bug did not prevent the default endpoints from being used, but it caused repeated attempts to store them in the index, resulting in logging errors on every usage.
smalyshev pushed a commit to smalyshev/elasticsearch that referenced this pull request Mar 21, 2025
…ting (elastic#125369)

The Elastic inference service removes the default models at startup if the node cannot access EIS.
Since elastic#125242 we don't store default models in the cluster state but we still try to delete them.
This change ensures that we don't try to update the cluster state when a default model is deleted
since the delete is not performed on the master node and default models are never stored in the cluster state.
elasticsearchmachine pushed a commit that referenced this pull request Mar 25, 2025
…ting (#125369) (#125597)

The Elastic inference service removes the default models at startup if the node cannot access EIS.
Since #125242 we don't store default models in the cluster state but we still try to delete them.
This change ensures that we don't try to update the cluster state when a default model is deleted
since the delete is not performed on the master node and default models are never stored in the cluster state.
omricohenn pushed a commit to omricohenn/elasticsearch that referenced this pull request Mar 28, 2025
…ic#125242)

When retrieving a default inference endpoint for the first time, the system automatically creates the endpoint.
However, unlike the `put inference model` action, the `get` action does not redirect the request to the master node.

Since elastic#121106, we rely on the assumption that every model creation (`put model`) must run on the master node, as it modifies the cluster state. However, this assumption led to a bug where the get action tries to store default inference endpoints from a different node.

This change resolves the issue by preventing default inference endpoints from being added to the cluster state. These endpoints are not strictly needed there, as they are already reported by inference services upon startup.

**Note:** This bug did not prevent the default endpoints from being used, but it caused repeated attempts to store them in the index, resulting in logging errors on every usage.
omricohenn pushed a commit to omricohenn/elasticsearch that referenced this pull request Mar 28, 2025
…ting (elastic#125369)

The Elastic inference service removes the default models at startup if the node cannot access EIS.
Since elastic#125242 we don't store default models in the cluster state but we still try to delete them.
This change ensures that we don't try to update the cluster state when a default model is deleted
since the delete is not performed on the master node and default models are never stored in the cluster state.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:ml Machine learning >non-issue Team:ML Meta label for the ML team v9.1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants