[ML] Regression in ML Trained Model Deployment Update Causes Failure

### Elasticsearch Version

8.11.3

### Installed Plugins

_No response_

### Java Version

21

### OS Version

N/A

### Problem Description

**Background**:
In high-availability environments with a sufficient number of ML nodes, I am experiencing failures when attempting to update the `number_of_allocations` for a trained model deployment using the `POST _ml/trained_models/{model_id}/deployment/_update` API. This regression, which surfaced after the introduction of PR #98139, prevents valid updates to the number of allocations, despite the availability of ample resources. This bug has significantly impacted the normal usage of the feature in our production environment.
![image](https://github.com/elastic/elasticsearch/assets/26055883/7f6ab200-9625-4387-b61d-c300ef579a1d)


**Description of the problem including expected versus actual behavior**:
To simplify this problem, I reproduce it below. Brief description here:
The update operation fails with a `status_exception`(429) when `number_of_allocations` is set to 30, and with an `illegal_argument_exception`(400) for a negative byte value when set to 31. This behavior is a departure from previous versions where such updates were successful.

### Steps to Reproduce

**Model information for reproduction**:

- Model ID: "baai__bge-base-zh-v1.5" (https://huggingface.co/BAAI/bge-base-zh-v1.5)
- Model size stats: { "model_size": "387.9mb", "model_size_bytes": 406780568 }

**Steps to reproduce**:

1. Deploy a trained model with a certain `number_of_allocations` on an environment with at least two ML nodes (32c64g).
2. Attempt to update the deployment with `number_of_allocations` set to 30 and observe the `status_exception`(429).
3. Attempt to update the deployment with `number_of_allocations` set to 31 and observe the `illegal_argument_exception`(400) with debug logs indicating a negative byte value.

### Logs (if relevant)

when `number_of_allocations` is set to 30 and 31 in version 8.11.3
<img width="1691" alt="image" src="https://github.com/elastic/elasticsearch/assets/26055883/a8708ff8-bffc-4f8f-a400-72ac621d02df">
debug log
![image](https://github.com/elastic/elasticsearch/assets/26055883/ac3c18c4-314c-424d-8bc7-7ce6a64b6e58)


Reference to the behavior in versions prior to 8.11, where the issue was not present
<img width="1687" alt="image" src="https://github.com/elastic/elasticsearch/assets/26055883/7a684ff4-e77e-4c83-911b-935e895be378">

**Analysis of the problem**:
Upon reviewing PR #98139, I discovered that the core issue lies in the changes made to the `org.elasticsearch.xpack.core.ml.action.StartTrainedModelDeploymentAction#estimateMemoryUsageBytes` method. The modification introduced a linear relationship between the estimated memory usage and the `numberOfAllocations`, which, when combined with the subsequent arithmetic operations in `org.elasticsearch.xpack.ml.inference.assignment.planning.AssignmentPlan.Builder#accountMemory`, can result in a negative value. This negative value is then incorrectly passed to `ByteSizeValue.ofBytes().toString()`, leading to an `IllegalArgumentException`.

**Proposed Solution**:

I plan to submit two PRs: the first to handle the immediate issue of negative byte values, and the second to revert the key changes introduced by PR #98139 as a temporary measure. Concurrently, I am eager to participate in discussions for a more permanent fix that aligns with the project's goals and architecture.

It is crucial to address this regression constructively, acknowledging the complexity of software development and the potential for unintended side effects in contributions. The focus is on rectifying the issue to ensure the stability and reliability of the Elasticsearch ML features.

**Additional context**:

The issue was identified through detailed analysis and is documented with screenshots and code references to facilitate understanding and reproduction of the problem. I am committed to working on a swift resolution for this issue, as it is affecting our production environment. I am also open to engaging in discussions for a long-term solution and willing to contribute to its development.

The regression affects all subsequent versions and requires prompt attention to mitigate its impact on production environments. I have documented the debugging process with screenshots and code references to facilitate understanding and reproduction of the problem.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Regression in ML Trained Model Deployment Update Causes Failure #107807

Elasticsearch Version

Installed Plugins

Java Version

OS Version

Problem Description

Steps to Reproduce

Logs (if relevant)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[ML] Regression in ML Trained Model Deployment Update Causes Failure #107807

Description

Elasticsearch Version

Installed Plugins

Java Version

OS Version

Problem Description

Steps to Reproduce

Logs (if relevant)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions