Skip to content

Making separate integration test for nccl test in gke a3 ultra#4622

Merged
shubpal07 merged 2 commits into
GoogleCloudPlatform:developfrom
shubpal07:shubham/gke-a3-ultra-integ-test
Sep 8, 2025
Merged

Making separate integration test for nccl test in gke a3 ultra#4622
shubpal07 merged 2 commits into
GoogleCloudPlatform:developfrom
shubpal07:shubham/gke-a3-ultra-integ-test

Conversation

@shubpal07

Copy link
Copy Markdown
Contributor

This PR refactors the monolithic gke-a3-ultragpu integration test into two distinct, focused tests: a lightweight Kueue validation test and a heavyweight NCCL performance test.
This change is critical to address a series of severe stability issues discovered during the Kueue-on-Helm migration. The previous, single-test structure was found to be brittle and was the root cause of CI failures where the cluster would self-destruct by scaling its own GPU nodes to zero. This new structure is more stable, provides faster feedback, is more resource-efficient, and completely avoids this failure.

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

@shubpal07 shubpal07 self-assigned this Sep 8, 2025
@shubpal07 shubpal07 requested review from a team and samskillman as code owners September 8, 2025 08:21
@shubpal07 shubpal07 added the release-improvements Added to release notes under the "Improvements" heading. label Sep 8, 2025
@shubpal07

Copy link
Copy Markdown
Contributor Author

Tested to new build file for creating new trigger named PR-test-gke-a3-ultragpu-nccl. Successfully passed

Comment thread tools/cloud-build/daily-tests/builds/gke-a3-ultragpu-nccl.yaml
@shubpal07 shubpal07 requested review from a team and agrawalkhushi18 September 8, 2025 08:53
@shubpal07 shubpal07 merged commit e0c6a23 into GoogleCloudPlatform:develop Sep 8, 2025
18 of 68 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-improvements Added to release notes under the "Improvements" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants