Skip to content

[ML] Fix: required_native_memory_bytes Calculated with Wrong Allocation Count#143077

Merged
valeriy42 merged 9 commits intoelastic:mainfrom
valeriy42:fix/is-107831
Feb 27, 2026
Merged

[ML] Fix: required_native_memory_bytes Calculated with Wrong Allocation Count#143077
valeriy42 merged 9 commits intoelastic:mainfrom
valeriy42:fix/is-107831

Conversation

@valeriy42
Copy link
Copy Markdown
Contributor

The trained model stats API was computing required_native_memory_bytes incorrectly. Since #98139, memory estimation has depended on the number of allocations, but the code used the total allocation count across all deployments instead of each deployment’s own count. As a result, when multiple NLP models were deployed with different allocation counts, every model’s required_native_memory_bytes was based on the same summed value, so changing allocations for one model incorrectly changed the reported memory for others. This only affected the Stats API output, not actual deployment behavior.

The fix computes required_native_memory_bytes per deployment using each deployment’s allocation count. TransportGetTrainedModelsStatsAction now passes Map<String, AssignmentStats> into modelSizeStats() instead of a summed allocation count, and modelSizeStats() emits per-deployment entries keyed by deploymentId with the correct numberOfAllocations. GetTrainedModelsStatsAction.Response.Builder.build() looks up model size stats by deploymentId first and falls back to modelId for undeployed or non-PyTorch models. Unit tests were added to cover per-deployment resolution, undeployed models, and the fallback path.

Fixes #107831

@valeriy42 valeriy42 added >bug :ml Machine learning Team:ML Meta label for the ML team auto-backport Automatically create backport pull requests when merged v9.4.0 v9.3.2 v9.2.7 labels Feb 25, 2026
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Hi @valeriy42, I've created a changelog YAML for you.

Comment on lines +316 to +319
TrainedModelSizeStats modelSizeStats = modelSizeStatsMap.getOrDefault(
deploymentId,
modelSizeStatsMap.get(modelId)
);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, the builder always looked up model size stats by model ID only. Now it does:

  • First: try deployment ID (for deployed PyTorch models, where stats are per deployment).
  • Fallback: use model ID (undeployed or non-PyTorch models).

So for a model with multiple deployments, you get the correct per-deployment required_native_memory_bytes instead of a single value per model.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the nested modelSideStatsMap.gets look a bit strange. And it's a bit inefficient: the 2nd lookup is always done.

I'd prefer:

  TrainedModelSizeStats modelSizeStats = modelSizeStatsMap.get(deploymentId);
  if (modelSizeStats == null) {
      modelSizeStats = modelSizeStatsMap.get(modelId);
  }

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and add your PR comment to the code please

parentTaskId,
l,
numberOfAllocations
deploymentStatsByDeploymentId
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Key part: Instead of passing a single summed numberOfAllocations into modelSizeStats(), it know passes Map<String, AssignmentStats> with per-deployment stats.

@valeriy42 valeriy42 marked this pull request as ready for review February 27, 2026 14:15
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/ml-core (Team:ML)

Copy link
Copy Markdown
Contributor

@jan-elastic jan-elastic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines +316 to +319
TrainedModelSizeStats modelSizeStats = modelSizeStatsMap.getOrDefault(
deploymentId,
modelSizeStatsMap.get(modelId)
);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the nested modelSideStatsMap.gets look a bit strange. And it's a bit inefficient: the 2nd lookup is always done.

I'd prefer:

  TrainedModelSizeStats modelSizeStats = modelSizeStatsMap.get(deploymentId);
  if (modelSizeStats == null) {
      modelSizeStats = modelSizeStatsMap.get(modelId);
  }

Comment on lines +316 to +319
TrainedModelSizeStats modelSizeStats = modelSizeStatsMap.getOrDefault(
deploymentId,
modelSizeStatsMap.get(modelId)
);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and add your PR comment to the code please

@valeriy42 valeriy42 merged commit bbb8dd5 into elastic:main Feb 27, 2026
33 of 35 checks passed
@valeriy42 valeriy42 deleted the fix/is-107831 branch February 27, 2026 18:29
valeriy42 added a commit to valeriy42/elasticsearch that referenced this pull request Feb 27, 2026
…on Count (elastic#143077)

The trained model stats API was computing `required_native_memory_bytes` incorrectly. Since elastic#98139, memory estimation has depended on the number of allocations, but the code used the total allocation count across all deployments instead of each deployment’s own count. As a result, when multiple NLP models were deployed with different allocation counts, every model’s `required_native_memory_bytes` was based on the same summed value, so changing allocations for one model incorrectly changed the reported memory for others. This only affected the Stats API output, not actual deployment behavior.

The fix computes `required_native_memory_bytes` per deployment using each deployment’s allocation count. `TransportGetTrainedModelsStatsAction` now passes `Map<String, AssignmentStats>` into `modelSizeStats()` instead of a summed allocation count, and `modelSizeStats()` emits per-deployment entries keyed by `deploymentId` with the correct `numberOfAllocations`. `GetTrainedModelsStatsAction.Response.Builder.build()` looks up model size stats by `deploymentId` first and falls back to `modelId` for undeployed or non-PyTorch models. Unit tests were added to cover per-deployment resolution, undeployed models, and the fallback path.


Fixes elastic#107831
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

💚 Backport successful

Status Branch Result
9.3
9.2

valeriy42 added a commit to valeriy42/elasticsearch that referenced this pull request Feb 27, 2026
…on Count (elastic#143077)

The trained model stats API was computing `required_native_memory_bytes` incorrectly. Since elastic#98139, memory estimation has depended on the number of allocations, but the code used the total allocation count across all deployments instead of each deployment’s own count. As a result, when multiple NLP models were deployed with different allocation counts, every model’s `required_native_memory_bytes` was based on the same summed value, so changing allocations for one model incorrectly changed the reported memory for others. This only affected the Stats API output, not actual deployment behavior.

The fix computes `required_native_memory_bytes` per deployment using each deployment’s allocation count. `TransportGetTrainedModelsStatsAction` now passes `Map<String, AssignmentStats>` into `modelSizeStats()` instead of a summed allocation count, and `modelSizeStats()` emits per-deployment entries keyed by `deploymentId` with the correct `numberOfAllocations`. `GetTrainedModelsStatsAction.Response.Builder.build()` looks up model size stats by `deploymentId` first and falls back to `modelId` for undeployed or non-PyTorch models. Unit tests were added to cover per-deployment resolution, undeployed models, and the fallback path.


Fixes elastic#107831
szybia added a commit to szybia/elasticsearch that referenced this pull request Feb 27, 2026
…cations

* upstream/main:
  Warn on API key version mismatch (elastic#143127)
  Fixed wrong malformed value ordering in synthetic source tests (elastic#143187)
  [ML] Fix: required_native_memory_bytes Calculated with Wrong Allocation Count (elastic#143077)
  Add configureBenchmarkLogging calls across the various benchmarks (elastic#143185)
  Mute org.elasticsearch.xpack.esql.CsvIT test {csv-spec:k8s-timeseries-avg-over-time.Avg_over_time_aggregate_metric_double_implicit_casting} elastic#143292
  Give system role permission to invoke shard refresh (elastic#143190)
  Mute testSyntheticSourceWithTranslogSnapshot (elastic#143260)
  Adds ResumeInfo Tests (elastic#142769)
  Use a static method to configure benchmark logging (elastic#143056)
  add connectors release notes (elastic#142884)
  Add CI triage guidance for AI agents (elastic#142994)
  ESQL: Data sources: ZSTD, BZIP2 (elastic#143228)
  [ES|QL] Channels issue when an agg is called with the same field (elastic#142180) (elastic#142269)
  Add support for project routing in reindex requests (elastic#142240)
elasticsearchmachine pushed a commit that referenced this pull request Feb 27, 2026
…on Count (#143077) (#143295)

The trained model stats API was computing `required_native_memory_bytes` incorrectly. Since #98139, memory estimation has depended on the number of allocations, but the code used the total allocation count across all deployments instead of each deployment’s own count. As a result, when multiple NLP models were deployed with different allocation counts, every model’s `required_native_memory_bytes` was based on the same summed value, so changing allocations for one model incorrectly changed the reported memory for others. This only affected the Stats API output, not actual deployment behavior.

The fix computes `required_native_memory_bytes` per deployment using each deployment’s allocation count. `TransportGetTrainedModelsStatsAction` now passes `Map<String, AssignmentStats>` into `modelSizeStats()` instead of a summed allocation count, and `modelSizeStats()` emits per-deployment entries keyed by `deploymentId` with the correct `numberOfAllocations`. `GetTrainedModelsStatsAction.Response.Builder.build()` looks up model size stats by `deploymentId` first and falls back to `modelId` for undeployed or non-PyTorch models. Unit tests were added to cover per-deployment resolution, undeployed models, and the fallback path.


Fixes #107831
elasticsearchmachine pushed a commit that referenced this pull request Feb 27, 2026
…on Count (#143077) (#143294)

The trained model stats API was computing `required_native_memory_bytes` incorrectly. Since #98139, memory estimation has depended on the number of allocations, but the code used the total allocation count across all deployments instead of each deployment’s own count. As a result, when multiple NLP models were deployed with different allocation counts, every model’s `required_native_memory_bytes` was based on the same summed value, so changing allocations for one model incorrectly changed the reported memory for others. This only affected the Stats API output, not actual deployment behavior.

The fix computes `required_native_memory_bytes` per deployment using each deployment’s allocation count. `TransportGetTrainedModelsStatsAction` now passes `Map<String, AssignmentStats>` into `modelSizeStats()` instead of a summed allocation count, and `modelSizeStats()` emits per-deployment entries keyed by `deploymentId` with the correct `numberOfAllocations`. `GetTrainedModelsStatsAction.Response.Builder.build()` looks up model size stats by `deploymentId` first and falls back to `modelId` for undeployed or non-PyTorch models. Unit tests were added to cover per-deployment resolution, undeployed models, and the fallback path.


Fixes #107831
tballison pushed a commit to tballison/elasticsearch that referenced this pull request Mar 3, 2026
…on Count (elastic#143077)

The trained model stats API was computing `required_native_memory_bytes` incorrectly. Since elastic#98139, memory estimation has depended on the number of allocations, but the code used the total allocation count across all deployments instead of each deployment’s own count. As a result, when multiple NLP models were deployed with different allocation counts, every model’s `required_native_memory_bytes` was based on the same summed value, so changing allocations for one model incorrectly changed the reported memory for others. This only affected the Stats API output, not actual deployment behavior.

The fix computes `required_native_memory_bytes` per deployment using each deployment’s allocation count. `TransportGetTrainedModelsStatsAction` now passes `Map<String, AssignmentStats>` into `modelSizeStats()` instead of a summed allocation count, and `modelSizeStats()` emits per-deployment entries keyed by `deploymentId` with the correct `numberOfAllocations`. `GetTrainedModelsStatsAction.Response.Builder.build()` looks up model size stats by `deploymentId` first and falls back to `modelId` for undeployed or non-PyTorch models. Unit tests were added to cover per-deployment resolution, undeployed models, and the fallback path.


Fixes elastic#107831
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-backport Automatically create backport pull requests when merged >bug :ml Machine learning Team:ML Meta label for the ML team v9.2.7 v9.3.2 v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[ML] Trained model size calculated incorrectly

3 participants