[ML] Adjust memory overhead for PyTorch models by droberts195 · Pull Request #86416 · elastic/elasticsearch

droberts195 · 2022-05-04T09:17:14Z

This PR fixes a bug where the native code overhead that we
assume for the memory needed to load shared libraries was
not considered if a PyTorch model was the first ML process
to load on a node.

Because that overhead was considered after the model had
been assigned to the node this could have caused inconsistencies
such as total assigned memory being greater than permitted,
or repeated autoscaling loops.

There is a complication though, as we want it to be possible to
load a good selection of PyTorch models on 2GB ML nodes in
Cloud. Many models are very close to the limit of what's
possible, and wouldn't fit if the extra 30MB was added in
without changing anything else. (This is particularly the case
when logging and metrics collection is enabled, because then
memory gets set aside for Filebeat and Metricbeat.)

To avoid drastically reducing the selection of models that will
fit on a 2GB node the per-process overhead is reduced from 270MB
to 240MB. Therefore, when there is just one model running per
node this PR has no effect on which models will fit. When multiple
models run on the same node the memory requirement is slightly
reduced compared to before. However, the 270MB was a pretty rough
estimate in the first place, so this is unlikely to be a major
problem.

This PR fixes a bug where the native code overhead that we assume for the memory needed to load shared libraries was not considered if a PyTorch model was the first ML process to load on a node. Because that overhead _was_ considered after the model had been assigned to the node this could have caused inconsistencies such as total assigned memory being greater than permitted, or repeated autoscaling loops. There is a complication though, as we want it to be possible to load a good selection of PyTorch models on 2GB ML nodes in Cloud. Many models are very close to the limit of what's possible, and wouldn't fit if the extra 30MB was added in without changing anything else. (This is particularly the case when logging and metrics collection is enabled, because then memory gets set aside for Filebeat and Metricbeat.) To avoid drastically reducing the selection of models that will fit on a 2GB node the per-process overhead is reduced from 270MB to 240MB. Therefore, when there is just one model running per node this PR has no effect on which models will fit. When multiple models run on the same node the memory requirement is slightly reduced compared to before. However, the 270MB was a pretty rough estimate in the first place, so this is unlikely to be a major problem.

elasticmachine · 2022-05-04T09:17:17Z

Pinging @elastic/ml-core (Team:ML)

elasticsearchmachine · 2022-05-04T09:18:00Z

Hi @droberts195, I've created a changelog YAML for you.

dimitris-athanasiou

LGTM

This PR fixes a bug where the native code overhead that we assume for the memory needed to load shared libraries was not considered if a PyTorch model was the first ML process to load on a node. Because that overhead _was_ considered after the model had been assigned to the node this could have caused inconsistencies such as total assigned memory being greater than permitted, or repeated autoscaling loops. There is a complication though, as we want it to be possible to load a good selection of PyTorch models on 2GB ML nodes in Cloud. Many models are very close to the limit of what's possible, and wouldn't fit if the extra 30MB was added in without changing anything else. (This is particularly the case when logging and metrics collection is enabled, because then memory gets set aside for Filebeat and Metricbeat.) To avoid drastically reducing the selection of models that will fit on a 2GB node the per-process overhead is reduced from 270MB to 240MB. Therefore, when there is just one model running per node this PR has no effect on which models will fit. When multiple models run on the same node the memory requirement is slightly reduced compared to before. However, the 270MB was a pretty rough estimate in the first place, so this is unlikely to be a major problem.

elasticsearchmachine · 2022-05-04T10:24:47Z

💚 Backport successful

Status	Branch	Result
✅	8.2

* [ML] Adjust memory overhead for PyTorch models (#86416) This PR fixes a bug where the native code overhead that we assume for the memory needed to load shared libraries was not considered if a PyTorch model was the first ML process to load on a node. Because that overhead _was_ considered after the model had been assigned to the node this could have caused inconsistencies such as total assigned memory being greater than permitted, or repeated autoscaling loops. There is a complication though, as we want it to be possible to load a good selection of PyTorch models on 2GB ML nodes in Cloud. Many models are very close to the limit of what's possible, and wouldn't fit if the extra 30MB was added in without changing anything else. (This is particularly the case when logging and metrics collection is enabled, because then memory gets set aside for Filebeat and Metricbeat.) To avoid drastically reducing the selection of models that will fit on a 2GB node the per-process overhead is reduced from 270MB to 240MB. Therefore, when there is just one model running per node this PR has no effect on which models will fit. When multiple models run on the same node the memory requirement is slightly reduced compared to before. However, the 270MB was a pretty rough estimate in the first place, so this is unlikely to be a major problem. * Fix test

…in the model assignment planner" (#101853) The original PR #98874 missed the memory overhead adjustment from #86416. As it caused some BWC test failures on the CI, I reverted it in #101834. This PR reintegrates the functionality and extends the BWC integration test with the memory constant depending on the version of the old cluster.

droberts195 added >bug :ml Machine learning v8.2.1 labels May 4, 2022

elasticmachine added the Team:ML Meta label for the ML team label May 4, 2022

elasticsearchmachine added the v8.3.0 label May 4, 2022

Update docs/changelog/86416.yaml

f141b8d

dimitris-athanasiou approved these changes May 4, 2022

View reviewed changes

droberts195 added the auto-backport-and-merge label May 4, 2022

droberts195 merged commit cb70d00 into elastic:master May 4, 2022

droberts195 deleted the adjust_pytorch_model_memory_requirement branch May 4, 2022 10:23

droberts195 mentioned this pull request May 4, 2022

[8.2] [ML] Adjust memory overhead for PyTorch models (#86416) #86419

Merged

valeriy42 mentioned this pull request Nov 7, 2023

Revert Revert "[ML] Use perAllocation and perDeployment memory usage in the model assignment planner" #101853

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Adjust memory overhead for PyTorch models#86416

[ML] Adjust memory overhead for PyTorch models#86416
droberts195 merged 2 commits intoelastic:masterfrom
droberts195:adjust_pytorch_model_memory_requirement

droberts195 commented May 4, 2022

Uh oh!

elasticmachine commented May 4, 2022

Uh oh!

elasticsearchmachine commented May 4, 2022

Uh oh!

dimitris-athanasiou left a comment

Uh oh!

elasticsearchmachine commented May 4, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

droberts195 commented May 4, 2022

Uh oh!

elasticmachine commented May 4, 2022

Uh oh!

elasticsearchmachine commented May 4, 2022

Uh oh!

dimitris-athanasiou left a comment

Choose a reason for hiding this comment

Uh oh!

elasticsearchmachine commented May 4, 2022

💚 Backport successful

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants