Skip to content

Removing MGLRU dependency from Google cloud cluster toolkit#4255

Merged
shubpal07 merged 1 commit into
GoogleCloudPlatform:developfrom
shubpal07:shubpal/bugs/414260682
Jun 9, 2025
Merged

Removing MGLRU dependency from Google cloud cluster toolkit#4255
shubpal07 merged 1 commit into
GoogleCloudPlatform:developfrom
shubpal07:shubpal/bugs/414260682

Conversation

@shubpal07

Copy link
Copy Markdown
Contributor

Submission Checklist

This change removes the disable-mglru DaemonSet and its associated configurations from our cluster toolkit. This workaround was originally implemented to mitigate a critical bug in specific GKE versions (1.30.0 to 1.30.5 and 1.31.0 to 1.31.1) where the MGLRU-enabled kernel caused incorrect memory accounting by the kubelet, leading to cluster instability and erroneous pod evictions.

Google has since resolved this issue by disabling MGLRU by default in the underlying Container-Optimized OS (COS) for all patched and current GKE versions. As a result, this workaround is now obsolete for any actively maintained GKE cluster and its continued presence in our toolkit adds unnecessary complexity.

Verified that all current GKE versions available in the Rapid, Regular, and Stable release channels use node images with MGLRU disabled by default, as per Google's official documentation.

Related Issues/References
Kubernetes Bug Report: Kubernetes Issue #127844

GKE Known Issues (Historical): Increased Pod eviction rates on GKE versions 1.30 and 1.31.
https://cloud.google.com/kubernetes-engine/docs/troubleshooting/known-issues#increased-pod-eviction-mglru
https://cloud.google.com/kubernetes-engine/docs/release-notes

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

@shubpal07 shubpal07 requested review from a team and samskillman as code owners June 9, 2025 07:31
@shubpal07 shubpal07 added the release-improvements Added to release notes under the "Improvements" heading. label Jun 9, 2025
@ighosh98

ighosh98 commented Jun 9, 2025

Copy link
Copy Markdown
Contributor

@shubpal07 Please run the relevant blueprint tests before merging

@parulbajaj01 parulbajaj01 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@shubpal07

Copy link
Copy Markdown
Contributor Author

PR tests for the corresponding Machine types passed.

@shubpal07 shubpal07 merged commit 05c9356 into GoogleCloudPlatform:develop Jun 9, 2025
19 of 71 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-improvements Added to release notes under the "Improvements" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants