Skip to content

Fix Slurm 25.05 topology.yaml parsing error#5558

Merged
AdarshK15 merged 2 commits into
GoogleCloudPlatform:developfrom
AdarshK15:fix/slurm-25-05-page-alignment
May 4, 2026
Merged

Fix Slurm 25.05 topology.yaml parsing error#5558
AdarshK15 merged 2 commits into
GoogleCloudPlatform:developfrom
AdarshK15:fix/slurm-25-05-page-alignment

Conversation

@AdarshK15

Copy link
Copy Markdown
Member

Description

This PR implements a fix for Slurm 25.05 to prevent a fatal startup failure when reading topology.yaml due to a page-alignment issue.

Problem

In Slurm 25.05, the YAML parser plugin uses memory-mapped I/O to read configuration files like topology.yaml. The parser strictly requires the input buffer to be NULL-terminated. If the file size is exactly a multiple of the page size (4096 bytes), the parser reads past the mapped buffer to check for the NULL terminator, potentially accessing random data in the adjacent page and failing with EINVAL (Invalid argument).

This causes a fatal error: fatal: Something wrong with reading /usr/local/etc/slurm/topology.yaml: Invalid argument.

Solution

Modified gen_topology_yaml in conf.py to check the size of the generated cloud_topology.yaml file. If the size is a multiple of 4096 bytes, the script appends a newline (\n) to the file. This ensures the file is never page-aligned, satisfying the Slurm parser without breaking the YAML validity.

Verification

  1. Reproduction: Verified on a live cluster running Slurm 25.05.3 by manually padding topology.yaml to exactly 4096 bytes. The Slurm controller failed to start with the expected Invalid argument error.
  2. Fix Verification: Modified conf.py to force the generated file size to 4096 bytes. Verified that the fix correctly appended a newline, resulting in a 4097-byte file, and slurmctld started successfully.

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical startup failure in Slurm 25.05 caused by the YAML parser's strict memory-mapped I/O requirements. By ensuring that the generated topology configuration file is never exactly a multiple of the system page size, the change prevents the parser from reading past the buffer and triggering an invalid argument error.

Highlights

  • Memory-mapped I/O fix: Added a check to ensure the generated topology.yaml file size is not a multiple of the page size (4096 bytes).
  • Padding implementation: Appends a newline character to the configuration file if it is page-aligned to prevent Slurm 25.05 parser errors.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@AdarshK15 AdarshK15 added the release-bugfix Added to release notes under the "Bug fixes" heading. label Apr 27, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a logic change to the gen_topology_yaml function in conf.py to append a newline character to the YAML file if its size is a multiple of 4096 bytes. The review feedback suggests using pathlib methods for file size checks and operations to maintain consistency with the existing codebase and eliminate the need for the os module import.

@LAVEEN LAVEEN left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@AdarshK15

AdarshK15 commented May 4, 2026

Copy link
Copy Markdown
Member Author

Test failure - PR-test-ml-a3-highgpu-onspot-slurm is failing due to an ongoing TCPx repository connection issue in startup script and not because of code changes in this PR.
Remaining Slurm tests passed.

@AdarshK15 AdarshK15 marked this pull request as ready for review May 4, 2026 04:43
@AdarshK15 AdarshK15 requested a review from a team as a code owner May 4, 2026 04:43
@AdarshK15 AdarshK15 merged commit b122205 into GoogleCloudPlatform:develop May 4, 2026
47 of 78 checks passed
@AdarshK15 AdarshK15 deleted the fix/slurm-25-05-page-alignment branch June 7, 2026 15:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-bugfix Added to release notes under the "Bug fixes" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants