Hotfix: update G4 NVIDIA drivers for kernel 6.17 compatibility#5289
Conversation
Summary of ChangesHello @SwarnaBharathiMantena, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request updates the NVIDIA driver and CUDA toolkit versions within the Highlights
Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request updates the NVIDIA drivers in the ml-slurm-g4.yaml example to version 590, likely to ensure compatibility with newer kernels. The changes include updating the list of nvidia_packages, adding a task to remove a conflicting firmware package, and using force-overwrite during the driver installation. While the updates seem reasonable, I have a concern regarding the use of force-overwrite. It's a powerful option that can reduce system robustness and maintainability. I've added a comment suggesting to either explore alternatives or to improve the in-code documentation to better justify its use, which would align with the project's goals for technical excellence and maintainability.
|
PR-test-ml-g4-onspot-slurm PR test is a success: https://pantheon.corp.google.com/cloud-build/builds;region=global/018f6b41-af5d-47bd-9ad7-b1a4301437cd?e=13803378&mods=monitoring_api_prod&project=hpc-toolkit-dev (This test includes the same changes, although on a different commit.) |
c8fa496
into
GoogleCloudPlatform:release-candidate
Hotfix: Update G4 NVIDIA drivers for kernel 6.17 compatibility
Callouts:
The issue being fixed here can re-occur if the image and package versions are not pinned.