[ML] Fix: required_native_memory_bytes Calculated with Wrong Allocation Count by valeriy42 · Pull Request #143077 · elastic/elasticsearch

valeriy42 · 2026-02-25T15:58:16Z

The trained model stats API was computing required_native_memory_bytes incorrectly. Since #98139, memory estimation has depended on the number of allocations, but the code used the total allocation count across all deployments instead of each deployment’s own count. As a result, when multiple NLP models were deployed with different allocation counts, every model’s required_native_memory_bytes was based on the same summed value, so changing allocations for one model incorrectly changed the reported memory for others. This only affected the Stats API output, not actual deployment behavior.

The fix computes required_native_memory_bytes per deployment using each deployment’s allocation count. TransportGetTrainedModelsStatsAction now passes Map<String, AssignmentStats> into modelSizeStats() instead of a summed allocation count, and modelSizeStats() emits per-deployment entries keyed by deploymentId with the correct numberOfAllocations. GetTrainedModelsStatsAction.Response.Builder.build() looks up model size stats by deploymentId first and falls back to modelId for undeployed or non-PyTorch models. Unit tests were added to cover per-deployment resolution, undeployed models, and the fallback path.

Fixes #107831

elasticsearchmachine · 2026-02-25T15:58:59Z

Hi @valeriy42, I've created a changelog YAML for you.

valeriy42 · 2026-02-27T13:55:54Z

...n/core/src/main/java/org/elasticsearch/xpack/core/ml/action/GetTrainedModelsStatsAction.java

+                            TrainedModelSizeStats modelSizeStats = modelSizeStatsMap.getOrDefault(
+                                deploymentId,
+                                modelSizeStatsMap.get(modelId)
+                            );


Previously, the builder always looked up model size stats by model ID only. Now it does:

First: try deployment ID (for deployed PyTorch models, where stats are per deployment).

Fallback: use model ID (undeployed or non-PyTorch models).

So for a model with multiple deployments, you get the correct per-deployment required_native_memory_bytes instead of a single value per model.

I think the nested modelSideStatsMap.gets look a bit strange. And it's a bit inefficient: the 2nd lookup is always done.

I'd prefer:

TrainedModelSizeStats modelSizeStats = modelSizeStatsMap.get(deploymentId); if (modelSizeStats == null) { modelSizeStats = modelSizeStatsMap.get(modelId); }

and add your PR comment to the code please

valeriy42 · 2026-02-27T14:04:01Z

...ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportGetTrainedModelsStatsAction.java

                    parentTaskId,
                    l,
-                    numberOfAllocations
+                    deploymentStatsByDeploymentId


Key part: Instead of passing a single summed numberOfAllocations into modelSizeStats(), it know passes Map<String, AssignmentStats> with per-deployment stats.

…earch into fix/is-107831

elasticsearchmachine · 2026-02-27T14:15:34Z

Pinging @elastic/ml-core (Team:ML)

jan-elastic

LGTM

docs/changelog/143077.yaml

jan-elastic · 2026-02-27T14:25:33Z

...n/core/src/main/java/org/elasticsearch/xpack/core/ml/action/GetTrainedModelsStatsAction.java

+                            TrainedModelSizeStats modelSizeStats = modelSizeStatsMap.getOrDefault(
+                                deploymentId,
+                                modelSizeStatsMap.get(modelId)
+                            );


I think the nested modelSideStatsMap.gets look a bit strange. And it's a bit inefficient: the 2nd lookup is always done.

I'd prefer:

TrainedModelSizeStats modelSizeStats = modelSizeStatsMap.get(deploymentId); if (modelSizeStats == null) { modelSizeStats = modelSizeStatsMap.get(modelId); }

jan-elastic · 2026-02-27T14:25:53Z

...n/core/src/main/java/org/elasticsearch/xpack/core/ml/action/GetTrainedModelsStatsAction.java

+                            TrainedModelSizeStats modelSizeStats = modelSizeStatsMap.getOrDefault(
+                                deploymentId,
+                                modelSizeStatsMap.get(modelId)
+                            );


and add your PR comment to the code please

...ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportGetTrainedModelsStatsAction.java

…on Count (elastic#143077) The trained model stats API was computing `required_native_memory_bytes` incorrectly. Since elastic#98139, memory estimation has depended on the number of allocations, but the code used the total allocation count across all deployments instead of each deployment’s own count. As a result, when multiple NLP models were deployed with different allocation counts, every model’s `required_native_memory_bytes` was based on the same summed value, so changing allocations for one model incorrectly changed the reported memory for others. This only affected the Stats API output, not actual deployment behavior. The fix computes `required_native_memory_bytes` per deployment using each deployment’s allocation count. `TransportGetTrainedModelsStatsAction` now passes `Map<String, AssignmentStats>` into `modelSizeStats()` instead of a summed allocation count, and `modelSizeStats()` emits per-deployment entries keyed by `deploymentId` with the correct `numberOfAllocations`. `GetTrainedModelsStatsAction.Response.Builder.build()` looks up model size stats by `deploymentId` first and falls back to `modelId` for undeployed or non-PyTorch models. Unit tests were added to cover per-deployment resolution, undeployed models, and the fallback path. Fixes elastic#107831

elasticsearchmachine · 2026-02-27T18:31:05Z

💚 Backport successful

Status	Branch	Result
✅	9.3
✅	9.2

…on Count (elastic#143077) The trained model stats API was computing `required_native_memory_bytes` incorrectly. Since elastic#98139, memory estimation has depended on the number of allocations, but the code used the total allocation count across all deployments instead of each deployment’s own count. As a result, when multiple NLP models were deployed with different allocation counts, every model’s `required_native_memory_bytes` was based on the same summed value, so changing allocations for one model incorrectly changed the reported memory for others. This only affected the Stats API output, not actual deployment behavior. The fix computes `required_native_memory_bytes` per deployment using each deployment’s allocation count. `TransportGetTrainedModelsStatsAction` now passes `Map<String, AssignmentStats>` into `modelSizeStats()` instead of a summed allocation count, and `modelSizeStats()` emits per-deployment entries keyed by `deploymentId` with the correct `numberOfAllocations`. `GetTrainedModelsStatsAction.Response.Builder.build()` looks up model size stats by `deploymentId` first and falls back to `modelId` for undeployed or non-PyTorch models. Unit tests were added to cover per-deployment resolution, undeployed models, and the fallback path. Fixes elastic#107831

…cations * upstream/main: Warn on API key version mismatch (elastic#143127) Fixed wrong malformed value ordering in synthetic source tests (elastic#143187) [ML] Fix: required_native_memory_bytes Calculated with Wrong Allocation Count (elastic#143077) Add configureBenchmarkLogging calls across the various benchmarks (elastic#143185) Mute org.elasticsearch.xpack.esql.CsvIT test {csv-spec:k8s-timeseries-avg-over-time.Avg_over_time_aggregate_metric_double_implicit_casting} elastic#143292 Give system role permission to invoke shard refresh (elastic#143190) Mute testSyntheticSourceWithTranslogSnapshot (elastic#143260) Adds ResumeInfo Tests (elastic#142769) Use a static method to configure benchmark logging (elastic#143056) add connectors release notes (elastic#142884) Add CI triage guidance for AI agents (elastic#142994) ESQL: Data sources: ZSTD, BZIP2 (elastic#143228) [ES|QL] Channels issue when an agg is called with the same field (elastic#142180) (elastic#142269) Add support for project routing in reindex requests (elastic#142240)

…on Count (#143077) (#143295) The trained model stats API was computing `required_native_memory_bytes` incorrectly. Since #98139, memory estimation has depended on the number of allocations, but the code used the total allocation count across all deployments instead of each deployment’s own count. As a result, when multiple NLP models were deployed with different allocation counts, every model’s `required_native_memory_bytes` was based on the same summed value, so changing allocations for one model incorrectly changed the reported memory for others. This only affected the Stats API output, not actual deployment behavior. The fix computes `required_native_memory_bytes` per deployment using each deployment’s allocation count. `TransportGetTrainedModelsStatsAction` now passes `Map<String, AssignmentStats>` into `modelSizeStats()` instead of a summed allocation count, and `modelSizeStats()` emits per-deployment entries keyed by `deploymentId` with the correct `numberOfAllocations`. `GetTrainedModelsStatsAction.Response.Builder.build()` looks up model size stats by `deploymentId` first and falls back to `modelId` for undeployed or non-PyTorch models. Unit tests were added to cover per-deployment resolution, undeployed models, and the fallback path. Fixes #107831

…on Count (#143077) (#143294) The trained model stats API was computing `required_native_memory_bytes` incorrectly. Since #98139, memory estimation has depended on the number of allocations, but the code used the total allocation count across all deployments instead of each deployment’s own count. As a result, when multiple NLP models were deployed with different allocation counts, every model’s `required_native_memory_bytes` was based on the same summed value, so changing allocations for one model incorrectly changed the reported memory for others. This only affected the Stats API output, not actual deployment behavior. The fix computes `required_native_memory_bytes` per deployment using each deployment’s allocation count. `TransportGetTrainedModelsStatsAction` now passes `Map<String, AssignmentStats>` into `modelSizeStats()` instead of a summed allocation count, and `modelSizeStats()` emits per-deployment entries keyed by `deploymentId` with the correct `numberOfAllocations`. `GetTrainedModelsStatsAction.Response.Builder.build()` looks up model size stats by `deploymentId` first and falls back to `modelId` for undeployed or non-PyTorch models. Unit tests were added to cover per-deployment resolution, undeployed models, and the fallback path. Fixes #107831

…on Count (elastic#143077) The trained model stats API was computing `required_native_memory_bytes` incorrectly. Since elastic#98139, memory estimation has depended on the number of allocations, but the code used the total allocation count across all deployments instead of each deployment’s own count. As a result, when multiple NLP models were deployed with different allocation counts, every model’s `required_native_memory_bytes` was based on the same summed value, so changing allocations for one model incorrectly changed the reported memory for others. This only affected the Stats API output, not actual deployment behavior. The fix computes `required_native_memory_bytes` per deployment using each deployment’s allocation count. `TransportGetTrainedModelsStatsAction` now passes `Map<String, AssignmentStats>` into `modelSizeStats()` instead of a summed allocation count, and `modelSizeStats()` emits per-deployment entries keyed by `deploymentId` with the correct `numberOfAllocations`. `GetTrainedModelsStatsAction.Response.Builder.build()` looks up model size stats by `deploymentId` first and falls back to `modelId` for undeployed or non-PyTorch models. Unit tests were added to cover per-deployment resolution, undeployed models, and the fallback path. Fixes elastic#107831

Fix: required_native_memory_bytes Calculated with Wrong Allocation Count

5e3d9be

valeriy42 added >bug :ml Machine learning Team:ML Meta label for the ML team auto-backport Automatically create backport pull requests when merged v9.4.0 v9.3.2 v9.2.7 labels Feb 25, 2026

Merge branch 'main' into fix/is-107831

cf048ca

valeriy42 and others added 3 commits February 25, 2026 16:58

Update docs/changelog/143077.yaml

24c5d5d

minor fixes

e52b0f1

[CI] Auto commit changes from spotless

fa511cb

valeriy42 commented Feb 27, 2026

View reviewed changes

valeriy42 added 2 commits February 27, 2026 15:14

minor fixes

cf45977

Merge branch 'fix/is-107831' of https://github.com/valeriy42/elastics…

f629c6e

…earch into fix/is-107831

valeriy42 marked this pull request as ready for review February 27, 2026 14:15

valeriy42 requested a review from jan-elastic February 27, 2026 14:15

jan-elastic approved these changes Feb 27, 2026

View reviewed changes

valeriy42 added 2 commits February 27, 2026 16:02

review comments

3f0a2cf

Merge branch 'main' into fix/is-107831

bd50770

valeriy42 merged commit bbb8dd5 into elastic:main Feb 27, 2026
33 of 35 checks passed

valeriy42 deleted the fix/is-107831 branch February 27, 2026 18:29

valeriy42 mentioned this pull request Feb 27, 2026

[9.3] [ML] Fix: required_native_memory_bytes Calculated with Wrong Allocation Count (#143077) #143294

Merged

valeriy42 mentioned this pull request Feb 27, 2026

[9.2] [ML] Fix: required_native_memory_bytes Calculated with Wrong Allocation Count (#143077) #143295

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Fix: required_native_memory_bytes Calculated with Wrong Allocation Count#143077

[ML] Fix: required_native_memory_bytes Calculated with Wrong Allocation Count#143077
valeriy42 merged 9 commits intoelastic:mainfrom
valeriy42:fix/is-107831

valeriy42 commented Feb 25, 2026

Uh oh!

elasticsearchmachine commented Feb 25, 2026

Uh oh!

valeriy42 Feb 27, 2026

Uh oh!

jan-elastic Feb 27, 2026

Uh oh!

jan-elastic Feb 27, 2026

Uh oh!

valeriy42 Feb 27, 2026

Uh oh!

elasticsearchmachine commented Feb 27, 2026

Uh oh!

jan-elastic left a comment

Uh oh!

Uh oh!

jan-elastic Feb 27, 2026

Uh oh!

jan-elastic Feb 27, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

elasticsearchmachine commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

valeriy42 commented Feb 25, 2026

Uh oh!

elasticsearchmachine commented Feb 25, 2026

Uh oh!

valeriy42 Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

jan-elastic Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

jan-elastic Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

valeriy42 Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

elasticsearchmachine commented Feb 27, 2026

Uh oh!

jan-elastic left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jan-elastic Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

jan-elastic Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

elasticsearchmachine commented Feb 27, 2026

💚 Backport successful

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants