-
Notifications
You must be signed in to change notification settings - Fork 607
Description
The previous non-rsvd max/limit_in_bytes does not account for reserved
huge page memory, making it possible for a processes to reserve all the
huge page memory, without being able to allocate it (due to cgroup
restrictions).
In practice this makes it possible to successfully mmap more huge page
memory than allowed via the cgroup settings, but when using the memory
the process will get a SIGBUS and crash. This is bad for applications
trying to mmap at startup (and it succeeds), but the program crashes
when starting to use the memory. eg. postgres is doing this by default.
This has lead to strange segfaults like these: patroni/patroni#1393
More info can be found here: https://lkml.org/lkml/2020/2/3/1153
In order to solve this, I think we have to main ways to do it:
- Add writes (when supported) to rsvd for the current
hugepageLimitsfound here. Silently ignore when rsvd is not supported. - Add another element called something like
hugepageLimitsRsvdto enforce the rsvd. value, silently fail or return error when rsvd is not supported.
I lean toward the first approach, since adding a new item makes it harder to understand, and may lead into "bad" implementations, but am an not sure at all. The pro for the last one, for having a separate entity is that it is then up to the user of the runtime to decide, giving the "user" a full choice, even tho. i see no real reason to enforce the "old" value and not the reserved one. The current behavior makes a cgroup limited process able to reserve all the huge page memory available on a node, making it inaccessible to others.
No matter the decition, we should then update the config-linux.md docs to clarify how it should work.
Any thoughts?
Simple WIP in runc to add support for enforcing it using the hugepageLimits is here: https://github.com/opencontainers/runc/pull/2360/files