Skip to content

Address shelve permissions#3951

Merged
mr0re1 merged 3 commits into
GoogleCloudPlatform:developfrom
casassg:gerardc/fix-shelve-permissions
Apr 25, 2025
Merged

Address shelve permissions#3951
mr0re1 merged 3 commits into
GoogleCloudPlatform:developfrom
casassg:gerardc/fix-shelve-permissions

Conversation

@casassg

@casassg casassg commented Apr 15, 2025

Copy link
Copy Markdown
Contributor

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

Fixes #3946

@google-cla

google-cla Bot commented Apr 15, 2025

Copy link
Copy Markdown

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@casassg casassg changed the base branch from main to develop April 16, 2025 14:41
@casassg casassg marked this pull request as ready for review April 18, 2025 05:48
@casassg casassg requested review from a team and samskillman as code owners April 18, 2025 05:48
@mr0re1

mr0re1 commented Apr 22, 2025

Copy link
Copy Markdown
Collaborator

Hello @casassg ,
Frankly, the original piece of code looks like terribly over complicated solution for an easy problem.
I'm hesitant to add more to it.

Could you please explain your motivation for this fix, so we could prioritize it (move forward with your fix or invest into better solution) ?

At a glance, this cache is "append only" :

  • records never deleted;
  • records never change (all rewrites are to the same value), due to immutable nature of instance template.

So access to this cache shouldn't involve any synchronizations. One way would be to stop using shelve and have directory where each file is dedicated to one cached instance template.
The content of file is either JSON (preferable) or pickled (makes for easier change).
To avoid "partial writes/reads" the addition of records could be done through (write_tmp + mv)

@casassg

casassg commented Apr 24, 2025

Copy link
Copy Markdown
Contributor Author

@mr0re1 agree w your assessment that this piece of code is more complex than needed, however it breaks allowing for partition updates without node reconstruction.

Issue (I think) is that in the initial run this shelve cache is created as root (the chown_slurm does basically nothing as the file its chowing does not exist). This means that on subsequent runs of the slurm management service which checks for changes in config files, when the update happens it can't write the cache. This leads to cache being essentially only the one from the initial run with partition changes which should be possible to update not updating at all.

to repro:

  • Deploy slurm w partition p1 with nodes a1-[0-1]
  • Change p1 to add new nodes (or something which should be an accepted change). Deploy change to cluster.
  • See controller logs for the above error.

regarding and have directory where each file is dedicated to one cached instance template agree as well. But tried to do the minimal change that fixed the above issue

@mr0re1

mr0re1 commented Apr 25, 2025

Copy link
Copy Markdown
Collaborator

@casassg , I've missed linked issue, thank you for reporting it!
This is serious issue, I'm inclined to take in your solution.

@mr0re1

mr0re1 commented Apr 25, 2025

Copy link
Copy Markdown
Collaborator

/gcbrun

@mr0re1 mr0re1 added the release-bugfix Added to release notes under the "Bug fixes" heading. label Apr 25, 2025
@mr0re1 mr0re1 merged commit 78b215c into GoogleCloudPlatform:develop Apr 25, 2025
@casassg casassg deleted the gerardc/fix-shelve-permissions branch April 28, 2025 18:22
@casassg

casassg commented Apr 28, 2025

Copy link
Copy Markdown
Contributor Author

great! lmk when this gets released so I can pull it from my side. I think this requires a full controller recreation (so that startup scripts are pulled)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-bugfix Added to release notes under the "Bug fixes" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants