Fix Slurm 25.05 topology.yaml parsing error#5558
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical startup failure in Slurm 25.05 caused by the YAML parser's strict memory-mapped I/O requirements. By ensuring that the generated topology configuration file is never exactly a multiple of the system page size, the change prevents the parser from reading past the buffer and triggering an invalid argument error. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a logic change to the gen_topology_yaml function in conf.py to append a newline character to the YAML file if its size is a multiple of 4096 bytes. The review feedback suggests using pathlib methods for file size checks and operations to maintain consistency with the existing codebase and eliminate the need for the os module import.
|
Test failure - PR-test-ml-a3-highgpu-onspot-slurm is failing due to an ongoing TCPx repository connection issue in startup script and not because of code changes in this PR. |
Description
This PR implements a fix for Slurm 25.05 to prevent a fatal startup failure when reading
topology.yamldue to a page-alignment issue.Problem
In Slurm 25.05, the YAML parser plugin uses memory-mapped I/O to read configuration files like topology.yaml. The parser strictly requires the input buffer to be NULL-terminated. If the file size is exactly a multiple of the page size (4096 bytes), the parser reads past the mapped buffer to check for the NULL terminator, potentially accessing random data in the adjacent page and failing with EINVAL (Invalid argument).
This causes a fatal error:
fatal: Something wrong with reading /usr/local/etc/slurm/topology.yaml: Invalid argument.Solution
Modified
gen_topology_yamlinconf.pyto check the size of the generatedcloud_topology.yamlfile. If the size is a multiple of 4096 bytes, the script appends a newline (\n) to the file. This ensures the file is never page-aligned, satisfying the Slurm parser without breaking the YAML validity.Verification
topology.yamlto exactly 4096 bytes. The Slurm controller failed to start with the expectedInvalid argumenterror.conf.pyto force the generated file size to 4096 bytes. Verified that the fix correctly appended a newline, resulting in a 4097-byte file, andslurmctldstarted successfully.