Skip to content

ci: support variable number of CPUs for building#39701

Merged
phlax merged 1 commit intoenvoyproxy:mainfrom
leonematt:build_dev
Aug 20, 2025
Merged

ci: support variable number of CPUs for building#39701
phlax merged 1 commit intoenvoyproxy:mainfrom
leonematt:build_dev

Conversation

@leonematt
Copy link
Copy Markdown
Contributor

@leonematt leonematt commented Jun 1, 2025

This pr updates the documentation to inform new developers how to troubleshoot crashes when building

@repokitteh-read-only
Copy link
Copy Markdown

Hi @leonematt, welcome and thank you for your contribution.

We will try to review your Pull Request as quickly as possible.

In the meantime, please take a look at the contribution guidelines if you have not done so already.

🐱

Caused by: #39701 was opened by leonematt.

see: more, trace.

Copy link
Copy Markdown
Member

@phlax phlax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for raising this

i see a couple of issues - one is that jobs means different things for a remote or local build

the other is that this is more of a bazel problem - and so should probs be fixed at that level
perhaps the best solution is just to put your desired number of jobs in your user.bazelrc

calculating the correct num of jobs for all situations is near impossible because it deps on ratio cpu/mem and what it is you are doing

@phlax phlax self-assigned this Jun 1, 2025
@leonematt
Copy link
Copy Markdown
Contributor Author

leonematt commented Jun 2, 2025

Thanks for the response and I have a few quick questions:

What is the difference between a job running on a remote vs local machine?

I see that NUM_CPUS is already used as a variable in the build script. Since NUM_CPUS is already calculated, would connecting it to job control improve the contributor experience? I am still new to bazel so its good that I am figuring this stuff out, but I imagine it would be helpful to make sure that users that want to contribute have the least amount of friction towards contributing and the j flag is common across popular projects that use make, cmake, etc.

Do you think it would be better to add to the documentation that you can change the the bazelrc files in case the developer runs out of memory, or maybe even implement an optimizer that figures out what the limiting factor is for building, either CPU or Memory, and then picking the number of CPUs off that comparison?

@phlax
Copy link
Copy Markdown
Member

phlax commented Jun 2, 2025

in a local setup jobs=num of cpus

in remote jobs=num of workers (irrespective of their core count)

I am still new to bazel so its good that I am figuring this stuff out,

im not exactly new to it, but still figuring this stuff out 8/

Do you think it would be better to add to the documentation that you can change the the bazelrc files ...

yeah i think this is probably the best we can do - ive just been testing locally - and even with --jobs=200 its plodding along nicely using all my cores (16) - so really not sure how we can set this any better

im defo aware of the issue - i think someone else may have asked about it recently, and im pretty sure ive hit from time to time

unfortunately our dev docs are a bit scattered - theres a couple of unrelated mentions about user.bazelrc - perhaps we could put something explicit/prominent about OOMing and reducing cores/jobs

bazel/README.md:    echo "build --config=clang" >> user.bazelrc
bazel/README.md:If you want to make libc++ as default, add a line `build --config=libc++` to the `user.bazelrc` file in Envoy source root.
bazel/README.md:You may persist those options in `user.bazelrc` in Envoy repo or your `.bazelrc`.
ci/README.md:1. add bazel options like `--remote_cache` and/or `--remote_cache_header` to a `.bazelrc` file, such as `user.bazelrc`. For example

@leonematt
Copy link
Copy Markdown
Contributor Author

leonematt commented Jun 2, 2025

I see. Let me get back to you on this soon. I didn't even know distributed builds are a thing, so this is introducing some complexity I need to figure out.

For your local build machine, how many gigabytes of RAM do you have? It might be the case that your builds generally succeed but sometimes fail because your system can only run 16 processes at a single time but you have enough RAM such that you only run out of memory infrequently.

Could you also do me a favor and run a local build with 20 jobs and tell me the run time using time or perhaps bazel has a timestamp feature? This will allow us to figure out if you are benefitting from having 200 jobs running locally versus running 20 jobs locally.

I ran htop while running my builds to figure out why they were failing, and I could see that all 32 CPUs that I had were saturated but I only have 32 GB of RAM in my machine and I saw that getting filled and then the build failing.

If what I am thinking is correct, I think we should implement a solution so that developers cannot fail to build a freshly-cloned repo.

@phlax
Copy link
Copy Markdown
Member

phlax commented Jun 2, 2025

many gigabytes of RAM do you have? It might be the case that your builds generally succeed

32GB and 16 cores - i think bazel can take 3-4G per core (iirc) - there are some large jobs, most are pretty small mem wise but if you gt the memhogs running at the same time then it can OOM

tbf i mostly build using the project rbe as im working on the project

Could you also do me a favor and run a local build with 20 jobs and tell me the run time using time or perhaps

im afraid i dont have the cycles for this, and there are way too many pieces of string in play to be of much use

Envoy is painful to build at first, once you have a cache it gets a lot easier

@leonematt
Copy link
Copy Markdown
Contributor Author

I understand. Do you think it would be a better idea to change this to the dynamic limiter implementation that picks the maximum resources while maintaining guaranteed success on builds? I think that would bring resiliency to the project and now the user would not be confused about local vs remote builds.

@KBaichoo
Copy link
Copy Markdown
Contributor

I think this is waiting on you @phlax

@phlax
Copy link
Copy Markdown
Member

phlax commented Jun 10, 2025

im not sure its the right solution for the right problem

rn - it uses all cpus-1 - not sure this would actually change that - it would just prevent it from being applied unless you used the ci/ path - this is kinda the opposite of what we might want to change (not convinced we do want to change anything here tho)

i honestly think the best path forward is to add better info about customizing bazel options with user.bazelrc, particularly wrt to ooming

i just dont think there is a perfect solution here - mostly, given the time it takes to build envoy, but also for devs re/building specific components, you want to use all available resources

i think this can lead to OOM if your cpu count is high relative to mem, or for whatever reason you trigger the most mem hungry builds simultaneously - but i think we have to balance that with the more common case of just wanting to use whatever you have available

@leonematt
Copy link
Copy Markdown
Contributor Author

I agree with the fact we should use whatever resources are available. My proposed solution isn't to run while leaving resources on the table. My proposed solution is to run with the maximal amount of resources while still allowing the build to succeed. Does running with all of the system resources matter if the build fails? That is why I proposed the idea of finding the limiting hardware resource factor, be it CPUs, Memory, Storage, etc, and configure the system to maximize the resources while allowing the build to succeed. Its "Hey, we have a ton of CPUs but not enough memory to build with the number of jobs running, so let's set the CPUs to what the system can handle while still being able to build."

@phlax
Copy link
Copy Markdown
Member

phlax commented Jun 11, 2025

...so let's set the CPUs to what the system can handle while still being able to build.

the problem is that nearly all the jobs take very little memory - its really only a few in terms of build that do use a lot - so for the most part you just want to run on all cores - its really only when it tries to build more than one of these more memory hungry jobs that its an issue i think

tldr - i think our defaults are pretty sane, any resolution should happen in .bazelrc (if at all) as then it doesnt rely on using the ci path to be triggered, and in this case the best solution really is to customize either through adding flags on command line, setting the relevant env var, or if you always want to build a particular way then user.bazelrc is the place for that

saying all that, i really do think this could be better documented

@leonematt
Copy link
Copy Markdown
Contributor Author

Alrighty then, I'll submit a documentation update about this issue.

@phlax
Copy link
Copy Markdown
Member

phlax commented Jun 11, 2025

thanks @leonematt ! that would be really appreciated

@phlax
Copy link
Copy Markdown
Member

phlax commented Jun 11, 2025

i should also mention - we recently landed a PR to optimize foreign_cc builds (ie c/make etc non-bazel builds)

the optimization is to use all cores - which in some cases these might do already - thse are pretty high on candidate list for jobs that are likely to eat memory

so that our ci optimizations dont make things worse in this respect - i added a new flag --//bazel/foreign_cc:parallel_builds which defaults to false - but will assertively use all cores for these jobs if set to true

@leonematt
Copy link
Copy Markdown
Contributor Author

Interesting.

the optimization is to use all cores - which in some cases these might do already - thse are pretty high on candidate list for jobs that are likely to eat memory

Yeah, the local build system does this by default.

so that our ci optimizations dont make things worse in this respect - i added a new flag --//bazel/foreign_cc:parallel_builds which defaults to false - but will assertively use all cores for these jobs if set to true

What does this flag do on false and true, exactly? Does this flag allow build configurations to be tailored to available hardware resources on the nodes?

I almost imagine that it might be more descriptive to use a variable like number_build_nodes to specify that you are using a distributed build system, but the parallel_builds variable also makes sense.

@phlax
Copy link
Copy Markdown
Member

phlax commented Jun 16, 2025

Yeah, the local build system does this by default.

im not 100% clear about this (either way) - from my limited testing at least with cmake jobs i had to add the -j flag to get it to use all cores - it may depend on what the cmake recipe itself does - not sure.

likewise - not sure how it handles other types - eg make - i think the default is single core

... Does this flag allow build configurations to be tailored to available hardware resources on the nodes? ... use a variable like number_build_nodes ... ?

atm it uses a bool flag as that is just easier to implement, and my reason for adding was slightly different.

further it has a single flag for all foreign jobs, again for reasons of simplicity

bazel-sklib does have an int_flag so we could probably use that if we wanted to specify the number of cores, potentially with separate flags for any related tasks

@phlax
Copy link
Copy Markdown
Member

phlax commented Jun 16, 2025

thinking through current impl - im remembering why i did it this way - ie 1 or all ...

in the rbe/ci context we want all - similar for local dev that is making cache-blowing changes to one or some foreign builds

in the dont-OOM-my-machine context - eg end user doing a full build - bazel is already taking all cores (other than the one allocated to the foreign build) so you generally just want this set to 1

@botengyao
Copy link
Copy Markdown
Member

/wait-any

Waiting for any actions from @leonematt

@github-actions
Copy link
Copy Markdown

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Jul 25, 2025
@leonematt leonematt requested a review from agrawroh as a code owner July 27, 2025 14:52
@leonematt
Copy link
Copy Markdown
Contributor Author

Hi @phlax, my apologies for the delayed response on this PR. I've been swamped with work and appreciate your patience. I had time this morning to finish the documentation update.

@github-actions github-actions bot removed the stale stalebot believes this issue/PR has not been touched recently label Jul 27, 2025
Copy link
Copy Markdown
Member

@phlax phlax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for working on this @leonematt

ive added a few cleanups/clarifications/suggestions

phlax
phlax previously approved these changes Aug 6, 2025
Copy link
Copy Markdown
Member

@phlax phlax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks for working through this @leonematt

agrawroh
agrawroh previously approved these changes Aug 6, 2025
@leonematt
Copy link
Copy Markdown
Contributor Author

Hi @agrawroh

Thank-you so much! Please don't merge yet because I do want to correct the PR and git commit history. Once I do that I will resolve the comments you made and we should be good to go!

@leonematt leonematt force-pushed the build_dev branch 3 times, most recently from 6f54d70 to 43d7e47 Compare August 8, 2025 00:40
@leonematt
Copy link
Copy Markdown
Contributor Author

Hi @agrawroh and @phlax

It is good to merge now that I updated the PR message and the commit message. Thank-you so much!

@mathetake
Copy link
Copy Markdown
Member

kindly ping @phlax, sounds like ready to go?

@phlax
Copy link
Copy Markdown
Member

phlax commented Aug 12, 2025

afaict feedback still needs to be addressed

/wait-any

This commit updates the documentation to inform new developers how to troubleshoot crashes when building

Signed-off-by: Matthew Leon <matthew.leon.tech@gmail.com>
@leonematt
Copy link
Copy Markdown
Contributor Author

Hi @phlax I totally missed @agrawroh's comments. I have just implemented them. Let me know if anything else needs to change for the documentation, please.

@leonematt leonematt requested review from agrawroh and phlax August 16, 2025 20:37
@ravenblackx
Copy link
Copy Markdown
Contributor

/retest

Copy link
Copy Markdown
Member

@agrawroh agrawroh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @phlax to take a final pass.

Copy link
Copy Markdown
Member

@phlax phlax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks again for iterating @leonematt

@phlax phlax merged commit bf2c671 into envoyproxy:main Aug 20, 2025
24 checks passed
melginaldi pushed a commit to melginaldi/envoy that referenced this pull request Aug 21, 2025
This pr updates the documentation to inform new developers how to
troubleshoot crashes when building

Signed-off-by: Matthew Leon <matthew.leon.tech@gmail.com>
Signed-off-by: Melissa Ginaldi <mginaldi@google.com>
melginaldi pushed a commit to melginaldi/envoy that referenced this pull request Aug 26, 2025
This pr updates the documentation to inform new developers how to
troubleshoot crashes when building

Signed-off-by: Matthew Leon <matthew.leon.tech@gmail.com>
Signed-off-by: Melissa Ginaldi <mginaldi@google.com>
wtzhang23 pushed a commit to wtzhang23/envoy that referenced this pull request Aug 27, 2025
This pr updates the documentation to inform new developers how to
troubleshoot crashes when building

Signed-off-by: Matthew Leon <matthew.leon.tech@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants