[ML] Fix: required_native_memory_bytes Calculated with Wrong Allocation Count#143077
[ML] Fix: required_native_memory_bytes Calculated with Wrong Allocation Count#143077valeriy42 merged 9 commits intoelastic:mainfrom
Conversation
|
Hi @valeriy42, I've created a changelog YAML for you. |
| TrainedModelSizeStats modelSizeStats = modelSizeStatsMap.getOrDefault( | ||
| deploymentId, | ||
| modelSizeStatsMap.get(modelId) | ||
| ); |
There was a problem hiding this comment.
Previously, the builder always looked up model size stats by model ID only. Now it does:
- First: try deployment ID (for deployed PyTorch models, where stats are per deployment).
- Fallback: use model ID (undeployed or non-PyTorch models).
So for a model with multiple deployments, you get the correct per-deployment required_native_memory_bytes instead of a single value per model.
There was a problem hiding this comment.
I think the nested modelSideStatsMap.gets look a bit strange. And it's a bit inefficient: the 2nd lookup is always done.
I'd prefer:
TrainedModelSizeStats modelSizeStats = modelSizeStatsMap.get(deploymentId);
if (modelSizeStats == null) {
modelSizeStats = modelSizeStatsMap.get(modelId);
}
There was a problem hiding this comment.
and add your PR comment to the code please
| parentTaskId, | ||
| l, | ||
| numberOfAllocations | ||
| deploymentStatsByDeploymentId |
There was a problem hiding this comment.
Key part: Instead of passing a single summed numberOfAllocations into modelSizeStats(), it know passes Map<String, AssignmentStats> with per-deployment stats.
|
Pinging @elastic/ml-core (Team:ML) |
| TrainedModelSizeStats modelSizeStats = modelSizeStatsMap.getOrDefault( | ||
| deploymentId, | ||
| modelSizeStatsMap.get(modelId) | ||
| ); |
There was a problem hiding this comment.
I think the nested modelSideStatsMap.gets look a bit strange. And it's a bit inefficient: the 2nd lookup is always done.
I'd prefer:
TrainedModelSizeStats modelSizeStats = modelSizeStatsMap.get(deploymentId);
if (modelSizeStats == null) {
modelSizeStats = modelSizeStatsMap.get(modelId);
}
| TrainedModelSizeStats modelSizeStats = modelSizeStatsMap.getOrDefault( | ||
| deploymentId, | ||
| modelSizeStatsMap.get(modelId) | ||
| ); |
There was a problem hiding this comment.
and add your PR comment to the code please
...ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportGetTrainedModelsStatsAction.java
Outdated
Show resolved
Hide resolved
...ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportGetTrainedModelsStatsAction.java
Outdated
Show resolved
Hide resolved
…on Count (elastic#143077) The trained model stats API was computing `required_native_memory_bytes` incorrectly. Since elastic#98139, memory estimation has depended on the number of allocations, but the code used the total allocation count across all deployments instead of each deployment’s own count. As a result, when multiple NLP models were deployed with different allocation counts, every model’s `required_native_memory_bytes` was based on the same summed value, so changing allocations for one model incorrectly changed the reported memory for others. This only affected the Stats API output, not actual deployment behavior. The fix computes `required_native_memory_bytes` per deployment using each deployment’s allocation count. `TransportGetTrainedModelsStatsAction` now passes `Map<String, AssignmentStats>` into `modelSizeStats()` instead of a summed allocation count, and `modelSizeStats()` emits per-deployment entries keyed by `deploymentId` with the correct `numberOfAllocations`. `GetTrainedModelsStatsAction.Response.Builder.build()` looks up model size stats by `deploymentId` first and falls back to `modelId` for undeployed or non-PyTorch models. Unit tests were added to cover per-deployment resolution, undeployed models, and the fallback path. Fixes elastic#107831
…on Count (elastic#143077) The trained model stats API was computing `required_native_memory_bytes` incorrectly. Since elastic#98139, memory estimation has depended on the number of allocations, but the code used the total allocation count across all deployments instead of each deployment’s own count. As a result, when multiple NLP models were deployed with different allocation counts, every model’s `required_native_memory_bytes` was based on the same summed value, so changing allocations for one model incorrectly changed the reported memory for others. This only affected the Stats API output, not actual deployment behavior. The fix computes `required_native_memory_bytes` per deployment using each deployment’s allocation count. `TransportGetTrainedModelsStatsAction` now passes `Map<String, AssignmentStats>` into `modelSizeStats()` instead of a summed allocation count, and `modelSizeStats()` emits per-deployment entries keyed by `deploymentId` with the correct `numberOfAllocations`. `GetTrainedModelsStatsAction.Response.Builder.build()` looks up model size stats by `deploymentId` first and falls back to `modelId` for undeployed or non-PyTorch models. Unit tests were added to cover per-deployment resolution, undeployed models, and the fallback path. Fixes elastic#107831
…cations * upstream/main: Warn on API key version mismatch (elastic#143127) Fixed wrong malformed value ordering in synthetic source tests (elastic#143187) [ML] Fix: required_native_memory_bytes Calculated with Wrong Allocation Count (elastic#143077) Add configureBenchmarkLogging calls across the various benchmarks (elastic#143185) Mute org.elasticsearch.xpack.esql.CsvIT test {csv-spec:k8s-timeseries-avg-over-time.Avg_over_time_aggregate_metric_double_implicit_casting} elastic#143292 Give system role permission to invoke shard refresh (elastic#143190) Mute testSyntheticSourceWithTranslogSnapshot (elastic#143260) Adds ResumeInfo Tests (elastic#142769) Use a static method to configure benchmark logging (elastic#143056) add connectors release notes (elastic#142884) Add CI triage guidance for AI agents (elastic#142994) ESQL: Data sources: ZSTD, BZIP2 (elastic#143228) [ES|QL] Channels issue when an agg is called with the same field (elastic#142180) (elastic#142269) Add support for project routing in reindex requests (elastic#142240)
…on Count (#143077) (#143295) The trained model stats API was computing `required_native_memory_bytes` incorrectly. Since #98139, memory estimation has depended on the number of allocations, but the code used the total allocation count across all deployments instead of each deployment’s own count. As a result, when multiple NLP models were deployed with different allocation counts, every model’s `required_native_memory_bytes` was based on the same summed value, so changing allocations for one model incorrectly changed the reported memory for others. This only affected the Stats API output, not actual deployment behavior. The fix computes `required_native_memory_bytes` per deployment using each deployment’s allocation count. `TransportGetTrainedModelsStatsAction` now passes `Map<String, AssignmentStats>` into `modelSizeStats()` instead of a summed allocation count, and `modelSizeStats()` emits per-deployment entries keyed by `deploymentId` with the correct `numberOfAllocations`. `GetTrainedModelsStatsAction.Response.Builder.build()` looks up model size stats by `deploymentId` first and falls back to `modelId` for undeployed or non-PyTorch models. Unit tests were added to cover per-deployment resolution, undeployed models, and the fallback path. Fixes #107831
…on Count (#143077) (#143294) The trained model stats API was computing `required_native_memory_bytes` incorrectly. Since #98139, memory estimation has depended on the number of allocations, but the code used the total allocation count across all deployments instead of each deployment’s own count. As a result, when multiple NLP models were deployed with different allocation counts, every model’s `required_native_memory_bytes` was based on the same summed value, so changing allocations for one model incorrectly changed the reported memory for others. This only affected the Stats API output, not actual deployment behavior. The fix computes `required_native_memory_bytes` per deployment using each deployment’s allocation count. `TransportGetTrainedModelsStatsAction` now passes `Map<String, AssignmentStats>` into `modelSizeStats()` instead of a summed allocation count, and `modelSizeStats()` emits per-deployment entries keyed by `deploymentId` with the correct `numberOfAllocations`. `GetTrainedModelsStatsAction.Response.Builder.build()` looks up model size stats by `deploymentId` first and falls back to `modelId` for undeployed or non-PyTorch models. Unit tests were added to cover per-deployment resolution, undeployed models, and the fallback path. Fixes #107831
…on Count (elastic#143077) The trained model stats API was computing `required_native_memory_bytes` incorrectly. Since elastic#98139, memory estimation has depended on the number of allocations, but the code used the total allocation count across all deployments instead of each deployment’s own count. As a result, when multiple NLP models were deployed with different allocation counts, every model’s `required_native_memory_bytes` was based on the same summed value, so changing allocations for one model incorrectly changed the reported memory for others. This only affected the Stats API output, not actual deployment behavior. The fix computes `required_native_memory_bytes` per deployment using each deployment’s allocation count. `TransportGetTrainedModelsStatsAction` now passes `Map<String, AssignmentStats>` into `modelSizeStats()` instead of a summed allocation count, and `modelSizeStats()` emits per-deployment entries keyed by `deploymentId` with the correct `numberOfAllocations`. `GetTrainedModelsStatsAction.Response.Builder.build()` looks up model size stats by `deploymentId` first and falls back to `modelId` for undeployed or non-PyTorch models. Unit tests were added to cover per-deployment resolution, undeployed models, and the fallback path. Fixes elastic#107831
The trained model stats API was computing
required_native_memory_bytesincorrectly. Since #98139, memory estimation has depended on the number of allocations, but the code used the total allocation count across all deployments instead of each deployment’s own count. As a result, when multiple NLP models were deployed with different allocation counts, every model’srequired_native_memory_byteswas based on the same summed value, so changing allocations for one model incorrectly changed the reported memory for others. This only affected the Stats API output, not actual deployment behavior.The fix computes
required_native_memory_bytesper deployment using each deployment’s allocation count.TransportGetTrainedModelsStatsActionnow passesMap<String, AssignmentStats>intomodelSizeStats()instead of a summed allocation count, andmodelSizeStats()emits per-deployment entries keyed bydeploymentIdwith the correctnumberOfAllocations.GetTrainedModelsStatsAction.Response.Builder.build()looks up model size stats bydeploymentIdfirst and falls back tomodelIdfor undeployed or non-PyTorch models. Unit tests were added to cover per-deployment resolution, undeployed models, and the fallback path.Fixes #107831