Skip to content

Set defaults for GPU driver, disk type and Jobset version for A3U blueprints#3679

Merged
annuay-google merged 1 commit into
GoogleCloudPlatform:developfrom
annuay-google:annuay/set-disk-and-gpu-driver-to-defaults
Feb 17, 2025
Merged

Set defaults for GPU driver, disk type and Jobset version for A3U blueprints#3679
annuay-google merged 1 commit into
GoogleCloudPlatform:developfrom
annuay-google:annuay/set-disk-and-gpu-driver-to-defaults

Conversation

@annuay-google

Copy link
Copy Markdown
Contributor
  • Removing unnecessary parameters from A3U blueprints
  • Jobset 0.7.2 is the recommended default
  • Using GPU driver version "LATEST" is risky and no longer required

Validated with NCCL test, Jobset based

#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
        1024            16     float    none      -1    44.55    0.02    0.02      0    43.69    0.02    0.02      0
        2048            32     float    none      -1    43.60    0.05    0.04      0    43.54    0.05    0.04      0
        4096            64     float    none      -1    43.86    0.09    0.09      0    43.76    0.09    0.09      0
        8192           128     float    none      -1    44.22    0.19    0.17      0    44.29    0.18    0.17      0
       16384           256     float    none      -1    44.96    0.36    0.34      0    44.86    0.37    0.34      0
       32768           512     float    none      -1    47.17    0.69    0.65      0    47.58    0.69    0.65      0
       65536          1024     float    none      -1    49.89    1.31    1.23      0    49.47    1.32    1.24      0
      131072          2048     float    none      -1    47.39    2.77    2.59      0    49.87    2.63    2.46      0
      262144          4096     float    none      -1    53.24    4.92    4.62      0    49.99    5.24    4.92      0
      524288          8192     float    none      -1    55.70    9.41    8.82      0    53.36    9.82    9.21      0
     1048576         16384     float    none      -1    74.28   14.12   13.24      0    71.12   14.74   13.82      0
     2097152         32768     float    none      -1    73.54   28.52   26.74      0    74.47   28.16   26.40      0
     4194304         65536     float    none      -1    78.53   53.41   50.07      0    82.40   50.90   47.72      0
     8388608        131072     float    none      -1    94.38   88.88   83.32      0    90.78   92.41   86.63      0
    16777216        262144     float    none      -1    122.8  136.61  128.07      0    121.4  138.14  129.51      0
    33554432        524288     float    none      -1    182.6  183.80  172.31      0    178.4  188.05  176.30      0
    67108864       1048576     float    none      -1    282.1  237.92  223.05      0    277.4  241.91  226.79      0
   134217728       2097152     float    none      -1    499.6  268.63  251.84      0    492.4  272.57  255.54      0
   268435456       4194304     float    none      -1    857.1  313.18  293.60      0    852.9  314.73  295.06      0
   536870912       8388608     float    none      -1   1553.1  345.69  324.08      0   1546.9  347.07  325.38      0

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

@annuay-google annuay-google changed the base branch from main to develop February 17, 2025 17:29
Comment thread examples/hypercompute_clusters/a3u-gke-gcs/a3u-gke-gcs.yaml
@annuay-google annuay-google force-pushed the annuay/set-disk-and-gpu-driver-to-defaults branch from 1fcb12e to 3162e7f Compare February 17, 2025 17:36
@annuay-google annuay-google added the release-improvements Added to release notes under the "Improvements" heading. label Feb 17, 2025

@SwarnaBharathiMantena SwarnaBharathiMantena left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@annuay-google annuay-google merged commit 1acb1e7 into GoogleCloudPlatform:develop Feb 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-improvements Added to release notes under the "Improvements" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants