Skip to content

Conversation

@polarathene
Copy link
Contributor

@polarathene polarathene commented May 13, 2023

What I did

Adjusted init configs to be as consistent as possible with handling the RLIMIT_NOFILE setting (number of open file descriptors a single process can have at one time).

  • An explicit soft limit + constrained hard limit prevents a variety of issues that occur when software is run in a container vs directly on the host.
    • In the past, a soft limit may have been problematic for dockerd itself (_non-issue since Go 1.19 / Aug 2022).
    • LimitNOFILE=infinity is a regression introduced in a May 2021 PR merging a 2018 commit from another repo, whereas the value of infinity became excessive on various systems starting from 2019.
    • 524288 is a standard hard limit on systems with systemd since v240, being sufficient for software and not known to cause any problems.
  • NOTE: The OpenRC config uses a single command for ulimit via rc_ulimit, preventing setting a soft and hard limit, thus it will maintain a high soft limit. You cannot use -H + -S in the same ulimit command to set separate soft/hard values.
Reasoning for `1024:524288` (click to expand)
  • infinity is especially problematic as a soft limit, introducing difficult to troubleshoot / discover bugs.
    • This became more problematic with systemd v240 & Go 1.19 releases adjusting the impact implicitly.
  • 1048576 was adjusted to 524288 elsewhere to align with docker.service / systemd.
    • The few that actually need this to be higher can use drop-in overrides. However that is likely more relevant for containerd.service than it is docker.service.
  • I have documented how LimitNOFILE in docker.service + containerd.service affects resource usage and operations:
    • 524288 should be more than enough to support 65k busybox containers (which itself would encounter various other bottlenecks along the way that require configuration changes to support).
    • Individual containers can request higher limits with --ulimit "nofile=soft:hard" (as root no greater than fs.nr_open).
  • Extensive reasoning for choosing 1024:524288 covered here.
  • systemd docs for LimitNOFILE:

    LimitNOFILE: Don't use. Be careful when raising the soft limit above 1024, since select(2) cannot function with file descriptors above 1023 on Linux.
    Nowadays, the hard limit defaults to 524288, a very high value compared to historical defaults

    Typically applications should increase their soft limit to the hard limit on their own, if they are OK with working with file descriptors above 1023, (i.e. do not use select(2)).
    Note that file descriptors are nowadays accounted like any other form of memory, thus there should not be any need to lower the hard limit.

How I did it

Approach (click to expand):
  • Used git blame history on config to lookup relevant PRs on Github, review the changes between them and discussions.
  • Engaged in main discussions on moby and containerd regarding the issue.
  • Researched external resources on the subject and investigated behaviour across several releases and distros where changes from Docker, systemd and Go, or a distro affected the behaviour. Shared findings.
  • Consolidated all of the information gathered to propose a positive improvement with evidence to support it.
Related History (click to expand):

More extensive history coverage here. containerd also had a similar LimitNOFILE config dance.

In addition, the ecosystem around Docker also changed. Notably changes introduced that affected what infinity resolves to (since systemd v240 in 2018Q4), and ignoring any configured soft limit (Go 1.19/1.20 releases between Aug 2022 and May 2023).


Additional references (click to expand):

How to verify it

Requires at least Moby 23.0.6, or 24.0.0 for an important fix introduced since Go 1.19.9 / 1.20.4:

FROM alpine
RUN echo "Soft: $(ulimit -Sn)" >> /limits.txt
RUN echo "Hard: $(ulimit -Hn)" >> /limits.txt
# `docker.service` limit applies to `docker build` since the adoption of BuildKit:
$ docker build --no-cache -t limits-test .
$ docker run --rm limits-test cat /limits.txt

Soft: 1024
Hard: 524288
Notes (click to expand)
  • Without this change, both Soft and Hard values would be the same and either 1048576 (2^20) or 1073741816 (2^30).

  • grep 'files' /proc/$(pidof dockerd)/limits will report the soft limit as the hard limit.

    This is expected because Go implicitly raises the soft limit since Go 1.19 released in Aug 2022. That is reset to the expected soft limit when the daemon runs some processes involved in docker build.

  • Testing at runtime with containers using docker run instead of via image builds requires a related change in containerd (min version required: 1.6.21 / 1.7.1). That can be verified as detailed here.

  • Another reproduction on limits that may be of interest.

Description for the changelog

Adjusted RLIMIT_NOFILE to more sensible defaults.

A picture of a cute animal (not mandatory but encouraged)

zootopia

@corhere
Copy link
Contributor

corhere commented May 16, 2023

Any reason not to drop the custom NOFILE limit configuration entirely and simply use the system's default, at least for the systemd unit?

@polarathene
Copy link
Contributor Author

polarathene commented May 17, 2023

@corhere If you're sure that all supported distros will be compatible with doing so, removing the line should be ok 👍 (CentOS 7 is cited as still supported earlier this year, which is stuck on an old systemd release prior to 240)

Alternatively, older supported distros, could be advised to use a drop-in override?

NOTE: contrib/init/sysvinit-redhat had no limit configured. Not sure if that was intentional, or if it's only on systems where the hard limit is high enough? It seemed ignored from this PR. Potentially no users with that init had workloads of hundred of containers. Especially today, most are probably on systemd.


Concerns - Verbose (click to expand)

Concern - May be preferable to keep LimitNOFILE for now

I don't think this LimitNOFILE line would cause any problems/surprises:

  • It's mostly redundant (since it should be the default on systemd v240+), but is somewhat informative to those not familiar with it, and would be consistent with the other supported init configs adjusting this limit.
  • There may be some value in git blame history, which is less obvious to track when a diff for a file is only deleting lines.

I wouldn't expect the default to change in future. Nor are we likely to have someone attempt to contribute LimitNOFILE=infinity again as I think maintainers are well aware of this issue now.

However users may find advice to set LimitNOFILE, as no doubt historical search results probably land on some problems / discussions that advise changing it to some value (usually without a soft value, as that wasn't compatible in the past), or existing users that already have a habit to modify/set it? The new comments in this PR may help better inform those users to keep the soft limit (avoiding some potential bugs).

Maybe keep the line around for a year or so before dropping it?:

  • Users / online resources have some time to be exposed to the change (vs missing line and adding their own "just to be safe")
  • Older distros with versions of systemd older than 240 may no longer need to be supported by moby?

Concern - Dropping LimitNOFILE for supported distros with systemd <240

A concern that was raised back in March was needing to still support CentOS 7, which I responded with a link to a Sep 2022 article:

In issues/23109, systemd explicitly states that CentOS 7 is too old and may not be upgraded to a higher version of systemd.

Prior to systemd v240, the default hard limit apparently varied by distro. I explored this a bit, if no hard limit is present for the service, it may be as low as 4096 (kernel default). This was the case on an old Ubuntu server VM I had access to.

Originally LimitNOFILE was added in docker.service to raise the soft limit high enough for the daemon to support managing hundreds of containers:

  • Since Aug 2022 (and fix released in May 2023) Go now handles that concern implicitly. Only the hard limit needs to be high enough now.
  • Since the LimitNOFILE introduction in 2014, I think containerd later took on responsibilities that dockerd managed, and there was a comment saying this would make the original problem less likely? (for docker.service) Might have been about a similar concern with max tasks/threads actually.

@polarathene

This comment was marked as resolved.

@tianon
Copy link
Member

tianon commented May 17, 2023

#45551 👀 (let's please not waste any cycles on Upstart 😅)

@polarathene polarathene force-pushed the fix/config-init-limits branch from 893e6d5 to 5687f27 Compare May 17, 2023 23:02
@neersighted
Copy link
Member

Could you please rebase to drop the original upstart change instead of reverting it in a later commit?

@polarathene polarathene force-pushed the fix/config-init-limits branch 2 times, most recently from 3c7e689 to 674afeb Compare May 18, 2023 09:44
@thaJeztah
Copy link
Member

Any reason not to drop the custom NOFILE limit configuration entirely and simply use the system's default, at least for the systemd unit?

Discussing in our sync call; perhaps this is the way we should go, and use systemd's defaults, which are the same currently. Does that work for you, @polarathene ?

@polarathene
Copy link
Contributor Author

use systemd's defaults, which are the same currently. Does that work for you, @polarathene ?

That would be fine for me, but you mentioned in the past you still support CentOS 7 (EOL in roughly a year).

That does not have a new enough systemd version and IIRC cannot support v240+, thus you'd probably fallback to 1024:4096. That may be too low for some operations (perhaps prune with enough items to remove).


I also expanded on that concern and another earlier, click to expand the collapsed details if that's helpful.

My stance remains that it is probably be wise to go with what this PR proposes, and then at a later release remove the LimitNOFILE line, at least if there are any unexpected regressions with this change it'll be better communicated with the explicit value, before transitioning to implicit where it may differ.

@neersighted
Copy link
Member

I'm in favor of LimitNOFILE with a TODO to remove after systemd <= 240 (EL 7) falls away; I think it's past time to get this in to a release, and I'll try to make sure we get this reviewed this week.

@cpuguy83
Copy link
Member

cpuguy83 commented Jul 1, 2023

Its probably easier to remove it now and add a patch for older distros at packaging time.

@polarathene
Copy link
Contributor Author

polarathene commented Jul 2, 2023

Its probably easier to remove it now and add a patch for older distros at packaging time.

What is the advantage of keeping a separate patch around to maintain?

I believe keeping the value explicit for now (as this PR proposes) is more beneficial to everyone.


Explicit defaults ensures all distros have the sane values (that would be implicit for modern distros), while documenting the change as a bonus (easier to track via git blame too).

When the older distros are considered unsupported, you'd remove the explicit config like you would the separate patch. This approach would transition more smoothly IMO.

It also aligns with the openrc and sysvinit configs being adjusted in the PR, and provides an inline description / diff for better context/visibility of the change. As I've mentioned earlier in the discussion, this has value in better informing a casual Docker user (that has read third-party advice to change this setting, either in the future or as a past habit when they apply it again for a new install of Docker).

@neersighted neersighted mentioned this pull request Jul 13, 2023
26 tasks
@cpuguy83
Copy link
Member

What is the advantage of keeping a separate patch around to maintain?

My thought here is we'd only be keeping it for a distro version that is rapidly headed towards EOL.
I know I don't even plan to release newer versions on centos7 but even so it is trivial to carry a patch to apply an explicit value on older distros until they drop off.
Meanwhile the default works well.

@neersighted
Copy link
Member

Given this is both repository is both "upstream" of packagers (Docker Inc., Mirantis, Microsoft) and this is a contrib/ file, I suggest we just do whatever we collectively think is easiest upstream.

@corhere do you have a preference? I know Mirantis maintains their own systemd units, so maybe there are no strong opinions?

@polarathene
Copy link
Contributor Author

polarathene commented Jul 20, 2023

Meanwhile the default works well.

I'm fine with either outcome personally, both would be much better than the current LimitNOFILE=infinity.

I would like to draw attention to the benefits I raised in my previous comment of keeping the LimitNOFILE explicit. I don't see any disadvantage there, it's simpler than a potential patch, consistent and may be better for users to encounter.

None of that matters for me, but might be worth taking into consideration.


Would you like me to make any changes to this PR?

Systemd aside, this PR also normalizes all the other init configs to the same hard limit 524288 (and the only other one that seems able to set a separate soft limit back to 1024). Are those changes still wanted?

Copy link
Member

@cpuguy83 cpuguy83 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed this with maintainers.
We'd like to get this change in for 25.0 release and agreed to just remove it completely.
In packaging we'll patch the unit file as needed (pretty much just centos7).

Are you good to make those changes?

@polarathene

This comment was marked as resolved.

Copy link
Member

@neersighted neersighted left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't squash-merge in this repository, so if you can rewrite history into a single commit, that would be appreciated. I don't think there's a need to split changes as systemd vs non-systemd.

@cpuguy83
Copy link
Member

LGTM just needs a squash.

@polarathene polarathene force-pushed the fix/config-init-limits branch from dbf46d2 to f3c7667 Compare July 26, 2023 00:27
During review, it was decided to remove `LimitNOFILE` from `docker.service` to rely on the systemd v240 implicit default of `1024:524288`. On supported platforms with systemd prior to v240, packagers will patch the service with an explicit `LimitNOFILE=1024:524288`.

- `1024` soft limit is an implicit default, avoiding unexpected breakage. Software that needs a higher limit should request to raise the soft limit for its process.
- `524288` hard limit is an implicit default since systemd v240 and is adequate for most processes (_half of the historical limit from `fs.nr_open` of `1048576`_), while 4096 is the implicit default from the kernel (often too low). Individual containers can be started with `--ulimit` when a larger hard limit is required.
- The hard limit may not exceed `fs.nr_open` (_which a value of `infinity` will resolve to_). On most systems with systemd v240 or newer, this will resolve to an excessive size of 2^30 (over 1 billion).
- When set to `infinity` (usually as the soft limit) software may experience significantly increased resource usage, resulting in a performance regression or runtime failures that are difficult to troubleshoot.
- OpenRC current config approach lacks support for different soft/hard limits being set as it adjusts additional limits and `ulimit` does not support mixed usage of `-H` + `-S`. A soft limit of `524288` is not ideal, but 2^19 is much less overhead than 2^30, whilst a hard limit of 4096 would be problematic for Docker.

Signed-off-by: Brennan Kinney <5098581+polarathene@users.noreply.github.com>
@polarathene
Copy link
Contributor Author

polarathene commented Aug 7, 2023

Friendly ping, not sure if you got notified of the commit from last month.

I've slightly revised the commit message today and rebased to current upstream. I also opened an equivalent PR for containerd here: containerd/containerd#8924

@neersighted neersighted merged commit 764419e into moby:master Aug 21, 2023
@thaJeztah thaJeztah added this to the 25.0.0 milestone Oct 26, 2023
@thaJeztah thaJeztah changed the title fix: Normalize RLIMIT_NOFILE to sensible defaults fix: Normalize RLIMIT_NOFILE (LimitNOFILE) to sensible defaults Jan 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

7 participants