Removing MGLRU dependency from Google cloud cluster toolkit#4255
Merged
shubpal07 merged 1 commit intoJun 9, 2025
Merged
Conversation
Contributor
|
@shubpal07 Please run the relevant blueprint tests before merging |
ighosh98
approved these changes
Jun 9, 2025
Contributor
Author
|
PR tests for the corresponding Machine types passed. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Submission Checklist
This change removes the
disable-mglruDaemonSet and its associated configurations from our cluster toolkit. This workaround was originally implemented to mitigate a critical bug in specific GKE versions (1.30.0 to 1.30.5 and 1.31.0 to 1.31.1) where the MGLRU-enabled kernel caused incorrect memory accounting by thekubelet, leading to cluster instability and erroneous pod evictions.Google has since resolved this issue by disabling MGLRU by default in the underlying Container-Optimized OS (COS) for all patched and current GKE versions. As a result, this workaround is now obsolete for any actively maintained GKE cluster and its continued presence in our toolkit adds unnecessary complexity.
Verified that all current GKE versions available in the Rapid, Regular, and Stable release channels use node images with MGLRU disabled by default, as per Google's official documentation.
Related Issues/References
Kubernetes Bug Report: Kubernetes Issue #127844
GKE Known Issues (Historical): Increased Pod eviction rates on GKE versions 1.30 and 1.31.
https://cloud.google.com/kubernetes-engine/docs/troubleshooting/known-issues#increased-pod-eviction-mglru
https://cloud.google.com/kubernetes-engine/docs/release-notes
NOTE: Community submissions can take up to 2 weeks to be reviewed.
Please take the following actions before submitting this pull request.