Skip to content

Adding "datacenter-gpu-manager-4-dev" as an additional installation in A* YAML files.#4623

Merged
Neelabh94 merged 2 commits into
GoogleCloudPlatform:developfrom
Neelabh94:feature/dcgm-dev-package
Sep 24, 2025
Merged

Adding "datacenter-gpu-manager-4-dev" as an additional installation in A* YAML files.#4623
Neelabh94 merged 2 commits into
GoogleCloudPlatform:developfrom
Neelabh94:feature/dcgm-dev-package

Conversation

@Neelabh94

@Neelabh94 Neelabh94 commented Sep 8, 2025

Copy link
Copy Markdown
Contributor

This change updates all A* Slurm/VM blueprints that install dcgm v4 to also install the datacenter-gpu-manager-4-dev package. This is necessary for the ops-agent to function properly.

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

@Neelabh94 Neelabh94 added the bug Something isn't working label Sep 8, 2025

@samskillman samskillman left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor change on the a3mega-slurm-gcsfuse-lssd blueprint. Otherwise LGTM. Please make sure to trigger PR tests.

Comment thread examples/machine-learning/a3-megagpu-8g/a3mega-slurm-gcsfuse-lssd-blueprint.yaml Outdated
@Neelabh94 Neelabh94 marked this pull request as ready for review September 9, 2025 03:26
@Neelabh94 Neelabh94 requested a review from a team as a code owner September 9, 2025 03:26
@Neelabh94 Neelabh94 self-assigned this Sep 9, 2025
@Neelabh94 Neelabh94 requested a review from bytetwin September 9, 2025 03:29
@Neelabh94 Neelabh94 added the release-bugfix Added to release notes under the "Bug fixes" heading. label Sep 9, 2025
LAVEEN
LAVEEN previously approved these changes Sep 19, 2025
@Neelabh94 Neelabh94 force-pushed the feature/dcgm-dev-package branch from 96fec98 to 44b0221 Compare September 19, 2025 16:27
samskillman
samskillman previously approved these changes Sep 22, 2025

@samskillman samskillman left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest removing extra newlines, otherwise LGTM.

Comment thread examples/machine-learning/a3-megagpu-8g/a3mega-slurm-blueprint.yaml Outdated
Comment thread examples/machine-learning/a4-highgpu-8g/a4high-slurm-blueprint.yaml Outdated
@Neelabh94 Neelabh94 merged commit 13270d4 into GoogleCloudPlatform:develop Sep 24, 2025
12 of 63 checks passed
@Neelabh94 Neelabh94 deleted the feature/dcgm-dev-package branch September 24, 2025 04:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working release-bugfix Added to release notes under the "Bug fixes" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants