[ML] Adjust memory overhead for PyTorch models#86416
Merged
droberts195 merged 2 commits intoelastic:masterfrom May 4, 2022
Merged
[ML] Adjust memory overhead for PyTorch models#86416droberts195 merged 2 commits intoelastic:masterfrom
droberts195 merged 2 commits intoelastic:masterfrom
Conversation
This PR fixes a bug where the native code overhead that we assume for the memory needed to load shared libraries was not considered if a PyTorch model was the first ML process to load on a node. Because that overhead _was_ considered after the model had been assigned to the node this could have caused inconsistencies such as total assigned memory being greater than permitted, or repeated autoscaling loops. There is a complication though, as we want it to be possible to load a good selection of PyTorch models on 2GB ML nodes in Cloud. Many models are very close to the limit of what's possible, and wouldn't fit if the extra 30MB was added in without changing anything else. (This is particularly the case when logging and metrics collection is enabled, because then memory gets set aside for Filebeat and Metricbeat.) To avoid drastically reducing the selection of models that will fit on a 2GB node the per-process overhead is reduced from 270MB to 240MB. Therefore, when there is just one model running per node this PR has no effect on which models will fit. When multiple models run on the same node the memory requirement is slightly reduced compared to before. However, the 270MB was a pretty rough estimate in the first place, so this is unlikely to be a major problem.
Collaborator
|
Pinging @elastic/ml-core (Team:ML) |
Collaborator
|
Hi @droberts195, I've created a changelog YAML for you. |
droberts195
added a commit
to droberts195/elasticsearch
that referenced
this pull request
May 4, 2022
This PR fixes a bug where the native code overhead that we assume for the memory needed to load shared libraries was not considered if a PyTorch model was the first ML process to load on a node. Because that overhead _was_ considered after the model had been assigned to the node this could have caused inconsistencies such as total assigned memory being greater than permitted, or repeated autoscaling loops. There is a complication though, as we want it to be possible to load a good selection of PyTorch models on 2GB ML nodes in Cloud. Many models are very close to the limit of what's possible, and wouldn't fit if the extra 30MB was added in without changing anything else. (This is particularly the case when logging and metrics collection is enabled, because then memory gets set aside for Filebeat and Metricbeat.) To avoid drastically reducing the selection of models that will fit on a 2GB node the per-process overhead is reduced from 270MB to 240MB. Therefore, when there is just one model running per node this PR has no effect on which models will fit. When multiple models run on the same node the memory requirement is slightly reduced compared to before. However, the 270MB was a pretty rough estimate in the first place, so this is unlikely to be a major problem.
Collaborator
💚 Backport successful
|
elasticsearchmachine
pushed a commit
that referenced
this pull request
May 4, 2022
* [ML] Adjust memory overhead for PyTorch models (#86416) This PR fixes a bug where the native code overhead that we assume for the memory needed to load shared libraries was not considered if a PyTorch model was the first ML process to load on a node. Because that overhead _was_ considered after the model had been assigned to the node this could have caused inconsistencies such as total assigned memory being greater than permitted, or repeated autoscaling loops. There is a complication though, as we want it to be possible to load a good selection of PyTorch models on 2GB ML nodes in Cloud. Many models are very close to the limit of what's possible, and wouldn't fit if the extra 30MB was added in without changing anything else. (This is particularly the case when logging and metrics collection is enabled, because then memory gets set aside for Filebeat and Metricbeat.) To avoid drastically reducing the selection of models that will fit on a 2GB node the per-process overhead is reduced from 270MB to 240MB. Therefore, when there is just one model running per node this PR has no effect on which models will fit. When multiple models run on the same node the memory requirement is slightly reduced compared to before. However, the 270MB was a pretty rough estimate in the first place, so this is unlikely to be a major problem. * Fix test
valeriy42
added a commit
that referenced
this pull request
Nov 7, 2023
…in the model assignment planner" (#101853) The original PR #98874 missed the memory overhead adjustment from #86416. As it caused some BWC test failures on the CI, I reverted it in #101834. This PR reintegrates the functionality and extends the BWC integration test with the memory constant depending on the version of the old cluster.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR fixes a bug where the native code overhead that we
assume for the memory needed to load shared libraries was
not considered if a PyTorch model was the first ML process
to load on a node.
Because that overhead was considered after the model had
been assigned to the node this could have caused inconsistencies
such as total assigned memory being greater than permitted,
or repeated autoscaling loops.
There is a complication though, as we want it to be possible to
load a good selection of PyTorch models on 2GB ML nodes in
Cloud. Many models are very close to the limit of what's
possible, and wouldn't fit if the extra 30MB was added in
without changing anything else. (This is particularly the case
when logging and metrics collection is enabled, because then
memory gets set aside for Filebeat and Metricbeat.)
To avoid drastically reducing the selection of models that will
fit on a 2GB node the per-process overhead is reduced from 270MB
to 240MB. Therefore, when there is just one model running per
node this PR has no effect on which models will fit. When multiple
models run on the same node the memory requirement is slightly
reduced compared to before. However, the 270MB was a pretty rough
estimate in the first place, so this is unlikely to be a major
problem.