Elasticsearch Version
8.11.3
Installed Plugins
No response
Java Version
21
OS Version
N/A
Problem Description
Background:
In high-availability environments with a sufficient number of ML nodes, I am experiencing failures when attempting to update the number_of_allocations for a trained model deployment using the POST _ml/trained_models/{model_id}/deployment/_update API. This regression, which surfaced after the introduction of PR #98139, prevents valid updates to the number of allocations, despite the availability of ample resources. This bug has significantly impacted the normal usage of the feature in our production environment.

Description of the problem including expected versus actual behavior:
To simplify this problem, I reproduce it below. Brief description here:
The update operation fails with a status_exception(429) when number_of_allocations is set to 30, and with an illegal_argument_exception(400) for a negative byte value when set to 31. This behavior is a departure from previous versions where such updates were successful.
Steps to Reproduce
Model information for reproduction:
Steps to reproduce:
- Deploy a trained model with a certain
number_of_allocations on an environment with at least two ML nodes (32c64g).
- Attempt to update the deployment with
number_of_allocations set to 30 and observe the status_exception(429).
- Attempt to update the deployment with
number_of_allocations set to 31 and observe the illegal_argument_exception(400) with debug logs indicating a negative byte value.
Logs (if relevant)
when number_of_allocations is set to 30 and 31 in version 8.11.3

debug log

Reference to the behavior in versions prior to 8.11, where the issue was not present

Analysis of the problem:
Upon reviewing PR #98139, I discovered that the core issue lies in the changes made to the org.elasticsearch.xpack.core.ml.action.StartTrainedModelDeploymentAction#estimateMemoryUsageBytes method. The modification introduced a linear relationship between the estimated memory usage and the numberOfAllocations, which, when combined with the subsequent arithmetic operations in org.elasticsearch.xpack.ml.inference.assignment.planning.AssignmentPlan.Builder#accountMemory, can result in a negative value. This negative value is then incorrectly passed to ByteSizeValue.ofBytes().toString(), leading to an IllegalArgumentException.
Proposed Solution:
I plan to submit two PRs: the first to handle the immediate issue of negative byte values, and the second to revert the key changes introduced by PR #98139 as a temporary measure. Concurrently, I am eager to participate in discussions for a more permanent fix that aligns with the project's goals and architecture.
It is crucial to address this regression constructively, acknowledging the complexity of software development and the potential for unintended side effects in contributions. The focus is on rectifying the issue to ensure the stability and reliability of the Elasticsearch ML features.
Additional context:
The issue was identified through detailed analysis and is documented with screenshots and code references to facilitate understanding and reproduction of the problem. I am committed to working on a swift resolution for this issue, as it is affecting our production environment. I am also open to engaging in discussions for a long-term solution and willing to contribute to its development.
The regression affects all subsequent versions and requires prompt attention to mitigate its impact on production environments. I have documented the debugging process with screenshots and code references to facilitate understanding and reproduction of the problem.
Elasticsearch Version
8.11.3
Installed Plugins
No response
Java Version
21
OS Version
N/A
Problem Description
Background:

In high-availability environments with a sufficient number of ML nodes, I am experiencing failures when attempting to update the
number_of_allocationsfor a trained model deployment using thePOST _ml/trained_models/{model_id}/deployment/_updateAPI. This regression, which surfaced after the introduction of PR #98139, prevents valid updates to the number of allocations, despite the availability of ample resources. This bug has significantly impacted the normal usage of the feature in our production environment.Description of the problem including expected versus actual behavior:
To simplify this problem, I reproduce it below. Brief description here:
The update operation fails with a
status_exception(429) whennumber_of_allocationsis set to 30, and with anillegal_argument_exception(400) for a negative byte value when set to 31. This behavior is a departure from previous versions where such updates were successful.Steps to Reproduce
Model information for reproduction:
Steps to reproduce:
number_of_allocationson an environment with at least two ML nodes (32c64g).number_of_allocationsset to 30 and observe thestatus_exception(429).number_of_allocationsset to 31 and observe theillegal_argument_exception(400) with debug logs indicating a negative byte value.Logs (if relevant)
when


number_of_allocationsis set to 30 and 31 in version 8.11.3debug log
Reference to the behavior in versions prior to 8.11, where the issue was not present

Analysis of the problem:
Upon reviewing PR #98139, I discovered that the core issue lies in the changes made to the
org.elasticsearch.xpack.core.ml.action.StartTrainedModelDeploymentAction#estimateMemoryUsageBytesmethod. The modification introduced a linear relationship between the estimated memory usage and thenumberOfAllocations, which, when combined with the subsequent arithmetic operations inorg.elasticsearch.xpack.ml.inference.assignment.planning.AssignmentPlan.Builder#accountMemory, can result in a negative value. This negative value is then incorrectly passed toByteSizeValue.ofBytes().toString(), leading to anIllegalArgumentException.Proposed Solution:
I plan to submit two PRs: the first to handle the immediate issue of negative byte values, and the second to revert the key changes introduced by PR #98139 as a temporary measure. Concurrently, I am eager to participate in discussions for a more permanent fix that aligns with the project's goals and architecture.
It is crucial to address this regression constructively, acknowledging the complexity of software development and the potential for unintended side effects in contributions. The focus is on rectifying the issue to ensure the stability and reliability of the Elasticsearch ML features.
Additional context:
The issue was identified through detailed analysis and is documented with screenshots and code references to facilitate understanding and reproduction of the problem. I am committed to working on a swift resolution for this issue, as it is affecting our production environment. I am also open to engaging in discussions for a long-term solution and willing to contribute to its development.
The regression affects all subsequent versions and requires prompt attention to mitigate its impact on production environments. I have documented the debugging process with screenshots and code references to facilitate understanding and reproduction of the problem.