Skip to content

stdenv: reintroduce limiting by system load#192799

Draft
vcunat wants to merge 1 commit intoNixOS:stagingfrom
vcunat:p/stdenv-limit-load
Draft

stdenv: reintroduce limiting by system load#192799
vcunat wants to merge 1 commit intoNixOS:stagingfrom
vcunat:p/stdenv-limit-load

Conversation

@vcunat
Copy link
Copy Markdown
Member

@vcunat vcunat commented Sep 24, 2022

Commit c2b898d removed the mechanism; now it's replaced by a different one with higher limit of $(nproc) * 2.

Unfortunately there's no good mechanism how to override it from outside like --cores or --max-jobs, but at least the new limit should be high enough to avoid the previous downsides.

Commit c2b898d removed the mechanism; now it's replaced
by a different one with higher limit of $(nproc) * 2.

Unfortunately there's no good mechanism how to override it from outside
like `--cores` or `--max-jobs`, but at least the new limit should be
high enough to avoid the previous downsides.
@vcunat vcunat added the 9.needs: community feedback This needs feedback from more community members. label Sep 24, 2022
@github-actions github-actions bot added 6.topic: python Python is a high-level, general-purpose programming language. 6.topic: stdenv Standard environment 6.topic: TeX Issues regarding texlive and TeX in general labels Sep 24, 2022
@vcunat
Copy link
Copy Markdown
Member Author

vcunat commented Sep 24, 2022

I'm opening this mainly for discussion for now. I believe that at least some load limit is better than none.

I was testing this on a 24-core machine to observe the behavior. Using $(nproc) * 3/2 was still leaving some periods where there was a noticeable under-utilization of CPU (by atop) due to the oscillation – though not large amount and not with all builds (llvm builds were doing this but not linux builds).

So I chose $(nproc) * 2 to improve on this. Also weakening the limit seemed more in line with my perception of community's reaction around the PR #192447.

@vcunat vcunat mentioned this pull request Sep 24, 2022
13 tasks
@jonringer
Copy link
Copy Markdown
Contributor

Ideally, I think should have this support on the nix side. Similar to how you can configure jobs and cores.

That being said, I think this is better than what we had.

@vcunat
Copy link
Copy Markdown
Member Author

vcunat commented Sep 24, 2022

Yes, adding a similar config to nix to populate NIX_LOAD_LIMIT could be done later, if this route gets consensus (in nixpkgs first). Now it doesn't take so much time for nix changes to get into the default version in nixpkgs.

@ck3d
Copy link
Copy Markdown
Contributor

ck3d commented Sep 24, 2022

For my setup a load limit twice the cores is too high. It is more likely that my machines run out of memory or starting to swap.
I think we have to introduce NIX_LOAD_LIMIT first (see NixOS/nix#6855) or set the value to a more conservative value, e.g. NIX_LOAD_LIMIT=$(ncores).

@jonringer
Copy link
Copy Markdown
Contributor

For my setup a load limit twice the cores is too high. It is more likely that my machines run out of memory or starting to swap.

Yea, there's a big difference between a 2-core CI machine, and something like my 128 vCPU server :/

@vcunat
Copy link
Copy Markdown
Member Author

vcunat commented Sep 24, 2022

A good default for everyone seems impossible with the mechanisms at hand. Generally I believe it helps to set zramSwap.enable = true;. (stalls due to zram-swapping will increase load, too)

@jonringer
Copy link
Copy Markdown
Contributor

I created NixOS/nix#7091 to make a longer term solution viable.

@ofborg ofborg bot added 10.rebuild-darwin-stdenv This PR causes stdenv to rebuild on Darwin and must target a staging branch. 10.rebuild-linux-stdenv This PR causes stdenv to rebuild on Linux and must target a staging branch. 10.rebuild-darwin: 501+ This PR causes many rebuilds on Darwin and should normally target the staging branches. 10.rebuild-darwin: 5001+ This PR causes many rebuilds on Darwin and must target the staging branches. 10.rebuild-linux: 501+ This PR causes many rebuilds on Linux and should normally target the staging branches. 10.rebuild-linux: 5001+ This PR causes many rebuilds on Linux and must target the staging branches. labels Sep 24, 2022
@grahamc
Copy link
Copy Markdown
Member

grahamc commented Sep 25, 2022

Yea, there's a big difference between a 2-core CI machine, and something like my 128 vCPU server :/

Note that most machines in Hydra have between 24 and 80 cores, but have between 2 and 80 cores allocated to the build. I agree there isn't a heuristic here that will be robust across hardware configurations.

@lheckemann
Copy link
Copy Markdown
Member

People who have RAM issues still have the option of reducing cores, and I'd argue that that's more appropriate. Load doesn't rise immediately, so it's likely that there's a burst with lots of processes at the beginning of a build (observed on hydra before the removal of the load limit). This can easily exhaust RAM too.

@jonringer
Copy link
Copy Markdown
Contributor

Yea, looked into what load actually provides, and I don't think it's as useful for burst workloads :(.

@vcunat
Copy link
Copy Markdown
Member Author

vcunat commented Sep 27, 2022

But by reducing --cores you slow down big builds, and many other builds are single-core. It's hard to find a good fixed split, unless you can be patient like Hydra (--cores 2 for almost all jobs).

@wegank wegank added 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md 2.status: merge conflict This PR has merge conflicts with the target branch labels Mar 19, 2024
@stale stale bot removed the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Mar 20, 2024
@wegank wegank added the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Jul 4, 2024
@emilazy
Copy link
Copy Markdown
Member

emilazy commented Jul 21, 2024

Linking #328677 for those subscribed to this.

@stale stale bot removed the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Jul 21, 2024
@wegank wegank added the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Jan 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

2.status: merge conflict This PR has merge conflicts with the target branch 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md 6.topic: python Python is a high-level, general-purpose programming language. 6.topic: stdenv Standard environment 6.topic: TeX Issues regarding texlive and TeX in general 9.needs: community feedback This needs feedback from more community members. 10.rebuild-darwin: 501+ This PR causes many rebuilds on Darwin and should normally target the staging branches. 10.rebuild-darwin: 5001+ This PR causes many rebuilds on Darwin and must target the staging branches. 10.rebuild-darwin-stdenv This PR causes stdenv to rebuild on Darwin and must target a staging branch. 10.rebuild-linux: 501+ This PR causes many rebuilds on Linux and should normally target the staging branches. 10.rebuild-linux: 5001+ This PR causes many rebuilds on Linux and must target the staging branches. 10.rebuild-linux-stdenv This PR causes stdenv to rebuild on Linux and must target a staging branch.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants