Skip to content

Make Gradle's memory management work appropriately in memory constrained docker containers draft#17628

Closed
Thrillpool wants to merge 3 commits into
gradle:masterfrom
Thrillpool:master
Closed

Make Gradle's memory management work appropriately in memory constrained docker containers draft#17628
Thrillpool wants to merge 3 commits into
gradle:masterfrom
Thrillpool:master

Conversation

@Thrillpool

@Thrillpool Thrillpool commented Jul 6, 2021

Copy link
Copy Markdown
Contributor

Resolves the issue described here #17610, in short, Gradle in a memory constrained docker context doesn't respect its memory limit which can lead to its host killing it, as well as general poor performance.

Presented in this PR is just a rough draft, the CGroupMemoryInfo code is tried and tested and confirmed to work very well in our CI pipeline, where we have in fact overwritten the MemInfoOsMemoryInfo classfile to have the logic of CGroupMemoryInfo.

Some thought will be needed to work out when to use this class, as some context, if the memory has not been constrained then the number that appears in /sys/fs/cgroup/memory/memory.limit_in_bytes is this magic max long of 9223372036854771712.

Context

I know from experience that many users have struggled with issues of CI pipelines having the docker daemon 'disappear', the underlying cause being the CI has killed the agent due to using too much memory. I suspect vew few people have identified the root cause, instead the memory of the agent has been increased.

This issue has been around a long time, the first time I can see someone hitting upon it is

#5102

which was never really resolved.

#7058 (comment) identified that cgroups should be used in some form, but refrained from doing it. The memory handling code linked alluded to in elasticsearch, https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/monitor/os/OsProbe.java looks more respectable than my attempt.

If it is determined such a change is not suitable, I would strongly ask that Gradle make this issue more explicit to people. I did not find the above issue when initially determining the problem, instead spending quite a lot of time reading the Gradle code and reasoning it myself. Documentation concerning problems in memory constrained docker containers would be very helpful.

Contributor Checklist

  • Review Contribution Guidelines
  • Make sure that all commits are signed off to indicate that you agree to the terms of Developer Certificate of Origin.
  • Make sure all contributed code can be distributed under the terms of the Apache License 2.0, e.g. the code was written by yourself or the original code is licensed under a license compatible to Apache License 2.0.
  • Check "Allow edit from maintainers" option in pull request so that additional changes can be pushed by Gradle team
  • Provide integration tests (under <subproject>/src/integTest) to verify changes from a user perspective
  • Provide unit tests (under <subproject>/src/test) to verify logic
  • Update User Guide, DSL Reference, and Javadoc for public-facing changes
  • Ensure that tests pass sanity check: ./gradlew sanityCheck
  • Ensure that tests pass locally: ./gradlew <changed-subproject>:quickTest

Gradle Core Team Checklist

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation
  • Recognize contributor in release notes

@ghale ghale left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @Thrillpool! The general approach here seems sound and something we should incorporate. I agree that the detection mechanism feels a little bit creaky - let me know what you think about the approach I suggested.

@Thrillpool Thrillpool force-pushed the master branch 2 times, most recently from f0ddd5a to d5d06e2 Compare September 17, 2021 20:50
@stale

stale Bot commented Nov 16, 2021

Copy link
Copy Markdown

This pull request has been automatically marked as stale because it has not had recent activity. Given the limited bandwidth of the team, it will be closed if no further activity occurs. If you intend to work on this pull request, please ask the team to reopen the PR or push a new PR. Thank you for your contributions.

@stale stale Bot added the stale label Nov 16, 2021
@stale

stale Bot commented Dec 7, 2021

Copy link
Copy Markdown

This pull request has been automatically closed due to inactivity. If you are still interested in contributing this, please ensure that it is rebased against the latest branch (usually master), all review comments have been addressed and the build is passing.

@stale stale Bot closed this Dec 7, 2021
@jasondamour

Copy link
Copy Markdown

@Thrillpool @pirateskipper Ping on this. It would be really helpful

@edmundmok

Copy link
Copy Markdown
Contributor

Thanks for this writeup @Thrillpool! We just encountered the same issue and added this patch ourselves.
It would be great if this PR can get merged or at least a section in the docs about this potential issue liked @Thrillpool suggested.

@ghale

ghale commented Aug 30, 2022

Copy link
Copy Markdown
Member

Reopening as this is still an issue and worth fixing.

@ghale ghale reopened this Aug 30, 2022
@ghale

ghale commented Aug 30, 2022

Copy link
Copy Markdown
Member

@Thrillpool I'd like to resurrect this PR. There was a DCO issue with the initial commits which might have been part of the reason why it languished. Can you resolve this, and if at all possible, can you see about getting it rebased onto current master? If rebasing is problematic, I can try to take a look at it when I get a chance.

@jbartok jbartok removed the stale label Sep 7, 2022
@hythloda hythloda added the from:contributor PR by an external contributor label Oct 26, 2022
@Thrillpool

Thrillpool commented Dec 25, 2022

Copy link
Copy Markdown
Contributor Author

Hi @ghale sorry it took me so long. I will do this promptly, hope there is still an interest.

I will be checking this every day for the next couple of weeks, hope we can get it resolved in that time. In a sense I would actually be quite excited to contribute to Gradle! Having now used it a lot, I think it's really an excellent piece of software, and actually that I was able to identify the issue myself in the source code I think is a testament to how well written it is...

@bantonsson bantonsson left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have been using a similar change as this one, together with a smaller change (mentioned in #22977), running for months on both 7.5 and 7.6 with excellent results.

Before (gradle daemon crashed):
Screen Shot 2022-12-07 at 14 12 19
After (everything works fine):
Screen Shot 2022-12-07 at 14 17 26

@Thrillpool

Thrillpool commented Dec 29, 2022

Copy link
Copy Markdown
Contributor Author

@ghale For some testing done

I made this repo

https://github.com/Thrillpool/gradle-to-break-docker

When I do

docker run -d -m=1g eclipse-temurin:11 sleep infinity

Get the above repo in and try to execute ./gradlew compileScala --max-workers=1 inside the container, the process fails with 137 and if you run with --debug one can see the workers not getting cleaned up to free memory.

Conversely, if I update gradle-process-services jar to contain my changes, then if I make another instance of eclipse-temurin with the same memory constraint and use this modified gradle, the same command will not fail. Further, now with --debug one can see the workers getting cleaned up 'worker daemon(s) expired to free some memory'.

Finally, I also built the same repo on my host (linux) machine with my patched version of Gradle and it went through OK, and actually I added some additional logging to validate the right memory manager was chosen.

For some insight on what the repo I made to demo the bug is, it's basically just I artifically make the compilation use 500m, and I have a series of subprojects each with different jvm args such that workers can't be reused, and then I have max workers 1, so in theory app1, app2, app3, app4 run in order and workers get cleaned up as needed, but if they are not cleaned up, they occupy 2000m and so will get 137d by our 1g container.

Signed-off-by: chinrank <thrillpool1@gmail.com>
@ghale ghale added this to the 8.1 RC1 milestone Dec 29, 2022
@ghale ghale added the a:bug This doesn't work as expected label Dec 29, 2022
Signed-off-by: chinrank <thrillpool1@gmail.com>
@bantonsson

Copy link
Copy Markdown

Hey @ghale @jbartok it would be great if someone could give this a look, since we need to patch gradle for every new release for our build to work on CircleCI

@ghale

ghale commented Jan 24, 2023

Copy link
Copy Markdown
Member

Thanks @Thrillpool, @bantonsson. This is in my queue, just haven't been able to shift my focus to it yet. It hasn't been forgotten - I'll look at it soon.

@CRogers

CRogers commented Feb 15, 2023

Copy link
Copy Markdown

This has been a long running issue for us that has got dramatically worse recently as we've upped the number of parallel tasks we run. We use circleci and gradle keeps spawning workers until all memory is exhausted due to using /proc/meminfo that does not understand cgroups. There's been a large uptick in oomkills across the 1000s of repos using gradle we manage, so we're also looking at either patching the distributions or changing this in plugin at runtime via reflection.

bot-gradle added a commit that referenced this pull request Feb 21, 2023
…ed Linux docker containers

Supersedes #17628

Co-authored-by: Gary Hale <gary@gradle.com>
@ghale

ghale commented Feb 21, 2023

Copy link
Copy Markdown
Member

This has been merged as part of #23944.

@ghale ghale closed this Feb 21, 2023
@ov7a ov7a removed this from the 8.1 RC1 milestone Mar 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

a:bug This doesn't work as expected from:contributor PR by an external contributor in:daemon

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make Gradle's memory management work appropriately in memory constrained docker containers

10 participants