Skip to content

Add daily tests for integrating with custom images#5184

Merged
Neelabh94 merged 1 commit intoGoogleCloudPlatform:developfrom
spaturi13:develop
Feb 9, 2026
Merged

Add daily tests for integrating with custom images#5184
Neelabh94 merged 1 commit intoGoogleCloudPlatform:developfrom
spaturi13:develop

Conversation

@spaturi13
Copy link
Contributor

@spaturi13 spaturi13 commented Feb 4, 2026

This PR introduces new Slurm cluster blueprints tailored for A3 Ultra and A4 High GPU instances on Google Cloud.

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

@spaturi13 spaturi13 requested review from a team and samskillman as code owners February 4, 2026 06:59
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @spaturi13, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the testing infrastructure for AI/HPC environments by introducing new Slurm cluster blueprints tailored for A3 Ultra and A4 High GPU instances on Google Cloud. It integrates essential GPU-specific software, optimizes storage access through GCS Fuse mounts with local caching, and establishes robust daily automated tests to validate the deployment and operational integrity of these high-performance computing configurations. The changes aim to ensure the reliability and performance of AI/HPC workloads on these advanced GPU platforms.

Highlights

  • New AI/HPC Slurm Blueprints: Introduced two new Slurm cluster blueprints: one for A3 Ultra GPU instances (a3ultra-aihpc-image-blueprint.yaml) and another for A4 High GPU instances (a4high-aihpc-image-blueprint.yaml), both designed for AI/HPC workloads.
  • GPU Software Integration: The A3 Ultra blueprint now includes automated installation and configuration of NCCL and the Google NCCL-GIB plugin, crucial for high-performance GPU communication. Both blueprints enable NVIDIA DCGM and Persistence Daemon for GPU monitoring and stability.
  • Enhanced Storage Configuration: Both new blueprints configure multiple GCS Fuse mounts for different data types (checkpoints, training data, model serving), leveraging local SSD caching for optimized I/O performance.
  • Automated Daily Testing: Added new Cloud Build configurations (ml-a3-ultragpu-aihpc-blueprint-test.yaml, ml-a4-highgpu-aihpc-blueprint-test.yaml) and corresponding Ansible test variables to establish daily integration tests for these AI/HPC Slurm images, ensuring continuous validation of their deployment and functionality.
  • Slurm Controller Enhancements: The controller startup scripts for both blueprints now include logic to enable GPU health checks via Slurm epilog scripts and password-free sudo for OS Admin Login users within compute nodes.
Changelog
  • tools/cloud-build/daily-tests/blueprints/a3ultra-aihpc-image-blueprint.yaml
    • Added a new blueprint for A3 Ultra Slurm clusters, including network, filestore, and GCS bucket configurations.
    • Integrated startup scripts for compute nodes to install Docker, configure Enroot, install NCCL and Google NCCL-GIB plugin, and enable NVIDIA DCGM and Persistence Daemon.
    • Configured A3 Ultra nodesets, partitions, and Slurm login/controller nodes with specific disk types and sizes.
    • Enabled external prolog/epilog scripts for GPU health checks and password-free sudo.
  • tools/cloud-build/daily-tests/blueprints/a4high-aihpc-image-blueprint.yaml
    • Added a new blueprint for A4 High Slurm clusters, including network, filestore, and GCS bucket configurations.
    • Integrated startup scripts for compute nodes to install Docker, configure Enroot, and enable NVIDIA DCGM and Persistence Daemon (including datacenter-gpu-manager installation).
    • Configured A4 High nodesets, partitions, and Slurm login/controller nodes with specific disk types and sizes.
    • Enabled external prolog/epilog scripts for GPU health checks and password-free sudo.
  • tools/cloud-build/daily-tests/builds/ml-a3-ultragpu-aihpc-blueprint-test.yaml
    • Added a new Cloud Build configuration for daily testing of the A3 Ultra AI/HPC blueprint.
    • Includes steps to check for running builds, determine available zones, modify the blueprint (disable deletion protection for testing), and execute an Ansible playbook for Slurm integration testing.
    • Uses secret variables for AIHPC_IMAGE_PROJECT and AIHPC_IMAGE_FAMILY.
  • tools/cloud-build/daily-tests/builds/ml-a4-highgpu-aihpc-blueprint-test.yaml
    • Added a new Cloud Build configuration for daily testing of the A4 High AI/HPC blueprint.
    • Includes steps to check for running builds, determine available zones, modify the blueprint (disable deletion protection for testing), and execute an Ansible playbook for Slurm integration testing.
    • Uses secret variables for AIHPC_IMAGE_PROJECT and AIHPC_IMAGE_FAMILY.
  • tools/cloud-build/daily-tests/tests/ml-a3-ultragpu-aihpc-blueprint-test.yml
    • Added Ansible test variables for the A3 Ultra AI/HPC blueprint.
    • Defines test_name, deployment_name, slurm_cluster_name, blueprint_yaml path, login/controller node patterns, network, and NCCL test path.
    • Specifies post_deploy_tests including mount validation, partition validation, Enroot, GPU Slurm, and NCCL tests.
    • Defines post_destroy_tasks to delete images.
    • Sets custom_vars like gpu_count, gpu_partition, test_persistenced, partitions, mounts, instance_labels, and enable_spot.
    • Provides cli_deployment_vars for region, zone, cluster size, image project/family, and spot VM enablement.
  • tools/cloud-build/daily-tests/tests/ml-a4-highgpu-aihpc-blueprint-test.yml
    • Added Ansible test variables for the A4 High AI/HPC blueprint.
    • Defines test_name, deployment_name, slurm_cluster_name, blueprint_yaml path, login/controller node patterns, and network.
    • Specifies post_deploy_tests including mount validation, partition validation, Enroot, and GPU Slurm tests.
    • Defines post_destroy_tasks to delete images.
    • Sets custom_vars like gpu_count, gpu_partition, test_persistenced, partitions, mounts, instance_labels, and enable_spot.
    • Provides cli_deployment_vars for region, zone, cluster size, image project/family, and spot VM enablement.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces daily integration tests for Slurm clusters using AI-HPC images on A3 Ultra and A4 High GPU instances. The changes include new blueprints, Cloud Build configurations, and test definitions. The overall structure is good and follows project patterns. I've identified a few areas for improvement, including removing some duplicated configuration in one of the blueprints.

@Neelabh94 Neelabh94 added external PR from external contributor labels Feb 4, 2026
@spaturi13 spaturi13 marked this pull request as draft February 5, 2026 10:45
@arpit974 arpit974 changed the title Add daily tests for integrating with ai-hpc slurm images Add daily tests for integrating with custom slurm images Feb 5, 2026
@arpit974 arpit974 changed the title Add daily tests for integrating with custom slurm images Add daily tests for integrating with custom images Feb 5, 2026
@spaturi13 spaturi13 force-pushed the develop branch 2 times, most recently from ce126d7 to a9ba251 Compare February 5, 2026 15:22
@spaturi13 spaturi13 marked this pull request as ready for review February 5, 2026 15:23
arpit974
arpit974 previously approved these changes Feb 6, 2026
@arpit974
Copy link
Contributor

arpit974 commented Feb 6, 2026

/gcbrun

@Neelabh94 Neelabh94 added the release-chore To not include into release notes label Feb 7, 2026
@Neelabh94
Copy link
Contributor

/gcbrun

@Neelabh94
Copy link
Contributor

/gcbrun

@Neelabh94 Neelabh94 enabled auto-merge (squash) February 9, 2026 14:02
@Neelabh94 Neelabh94 merged commit 950a91c into GoogleCloudPlatform:develop Feb 9, 2026
11 of 79 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

external PR from external contributor release-chore To not include into release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants