Skip to content

[8.2] [ML] Adjust memory overhead for PyTorch models (#86416)#86419

Merged
elasticsearchmachine merged 2 commits intoelastic:8.2from
droberts195:backport/8.2/pr-86416
May 4, 2022
Merged

[8.2] [ML] Adjust memory overhead for PyTorch models (#86416)#86419
elasticsearchmachine merged 2 commits intoelastic:8.2from
droberts195:backport/8.2/pr-86416

Conversation

@droberts195
Copy link
Copy Markdown

Backports the following commits to 8.2:

This PR fixes a bug where the native code overhead that we
assume for the memory needed to load shared libraries was
not considered if a PyTorch model was the first ML process
to load on a node.

Because that overhead _was_ considered after the model had
been assigned to the node this could have caused inconsistencies
such as total assigned memory being greater than permitted,
or repeated autoscaling loops.

There is a complication though, as we want it to be possible to
load a good selection of PyTorch models on 2GB ML nodes in
Cloud. Many models are very close to the limit of what's
possible, and wouldn't fit if the extra 30MB was added in
without changing anything else. (This is particularly the case
when logging and metrics collection is enabled, because then
memory gets set aside for Filebeat and Metricbeat.)

To avoid drastically reducing the selection of models that will
fit on a 2GB node the per-process overhead is reduced from 270MB
to 240MB. Therefore, when there is just one model running per
node this PR has no effect on which models will fit. When multiple
models run on the same node the memory requirement is slightly
reduced compared to before. However, the 270MB was a pretty rough
estimate in the first place, so this is unlikely to be a major
problem.
@droberts195 droberts195 added :ml Machine learning >bug auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) backport Team:ML Meta label for the ML team labels May 4, 2022
@elasticmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/ml-core (Team:ML)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) backport >bug :ml Machine learning Team:ML Meta label for the ML team v8.2.1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants