Skip to content

NCCL integration tests#3697

Merged
annuay-google merged 1 commit into
GoogleCloudPlatform:developfrom
annuay-google:annuay/nccl-integration-tests
Mar 3, 2025
Merged

NCCL integration tests#3697
annuay-google merged 1 commit into
GoogleCloudPlatform:developfrom
annuay-google:annuay/nccl-integration-tests

Conversation

@annuay-google

@annuay-google annuay-google commented Feb 21, 2025

Copy link
Copy Markdown
Contributor

What?

NCCL integration tests for A+ machine families

Why?

  1. NCCL tests help validate functionality and baseline performance of the cluster
  2. Regressions have crept in due to our current integration tests only validating provisioning and nvidia-smi. An example was the change to rename kubernetes network objects was not caught for breaking the NCCL tests we ship to customers

Testing

Validated by running the tests via cloud build:
A3 Ultra: https://pantheon.corp.google.com/cloud-build/builds;region=global/234a1ddf-b194-4129-a640-0b5b081373a1?e=13803378&mods=monitoring_api_prod&project=hpc-toolkit-dev
A3 Mega: https://pantheon.corp.google.com/cloud-build/triggers;region=global/edit/e2eb58fd-613e-4f32-b107-debed1cd573b?e=13803378&mods=monitoring_api_prod&project=hpc-toolkit-dev
A3 High: https://pantheon.corp.google.com/cloud-build/builds;region=global/6ee8fc8f-f718-4446-b11b-01bfa8cc647d?e=13803378&mods=monitoring_api_prod&project=hpc-toolkit-dev

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

Comment thread tools/cloud-build/daily-tests/builds/gke-a2-highgpu-kueue.yaml
@annuay-google annuay-google marked this pull request as ready for review February 26, 2025 15:55
@annuay-google annuay-google added the release-improvements Added to release notes under the "Improvements" heading. label Feb 26, 2025

@SwarnaBharathiMantena SwarnaBharathiMantena left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@annuay-google annuay-google changed the base branch from develop to main March 3, 2025 18:15
@annuay-google annuay-google dismissed SwarnaBharathiMantena’s stale review March 3, 2025 18:15

The base branch was changed.

@annuay-google annuay-google changed the base branch from main to develop March 3, 2025 18:15

@ighosh98 ighosh98 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@annuay-google annuay-google merged commit 39c921d into GoogleCloudPlatform:develop Mar 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-improvements Added to release notes under the "Improvements" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants