Skip to content

Update _linux-test to support B200 runner#157341

Closed
huydhn wants to merge 16 commits intomainfrom
support-b200-runners
Closed

Update _linux-test to support B200 runner#157341
huydhn wants to merge 16 commits intomainfrom
support-b200-runners

Conversation

@huydhn
Copy link
Contributor

@huydhn huydhn commented Jul 1, 2025

This unblocks pytorch/test-infra#6869. The key changes to call out:

  • B200 needs OIDC to access ECR and upload stats to S3, so we need to set id-token: write in _linux-test. All workflows calling _linux-test also need to be updated accordingly
  • Connecting sccache to S3 on B200 doesn't seem to work, so I disable it. It still works locally though.

Testing

https://github.com/pytorch/pytorch/actions/runs/16055549292/job/45312298376

Signed-off-by: Huy Do <huydhn@gmail.com>
@pytorch-bot
Copy link

pytorch-bot bot commented Jul 1, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/157341

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 45d6a75 with merge base 6dc2b22 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

huydhn added 2 commits July 1, 2025 14:55
Signed-off-by: Huy Do <huydhn@gmail.com>
Signed-off-by: Huy Do <huydhn@gmail.com>
@nWEIdia
Copy link
Collaborator

nWEIdia commented Jul 2, 2025

time to open up the docker permissions on the runner? To fix this ?

@huydhn
Copy link
Contributor Author

huydhn commented Jul 2, 2025

time to open up the docker permissions on the runner? To fix this ?

Yes, we need to make sure that docker daemon is running on the runner, and whatever the user that runs GitHub daemon is in docker group so that it has the necessary permission. Could you help set that up plz?

From what I see in https://github.com/pytorch/pytorch/actions/runs/16011916708/job/45173325662?pr=157341#step:9:67, at least the logging in to ECR is ok now and docker can see the image https://github.com/pytorch/pytorch/actions/runs/16011916708/job/45173325662?pr=157341#step:9:67

@huydhn
Copy link
Contributor Author

huydhn commented Jul 2, 2025

@nWEIdia If possible, could you also help add awscli to the runner to address this issue https://github.com/pytorch/pytorch/actions/runs/16011916708/job/45173325662?pr=157341#step:9:52. Usually, a pip install awscli command will do

Another question: Does the GitHub daemon user have sudo?

@nWEIdia
Copy link
Collaborator

nWEIdia commented Jul 2, 2025

Both of the above steps are done:

  1. docker command is runnable without encountering the daemon issue
  2. aws command could be found.
    I am restarting your job.

huydhn added 10 commits July 1, 2025 22:05
Signed-off-by: Huy Do <huydhn@gmail.com>
Signed-off-by: Huy Do <huydhn@gmail.com>
Signed-off-by: Huy Do <huydhn@gmail.com>
Signed-off-by: Huy Do <huydhn@gmail.com>
Signed-off-by: Huy Do <huydhn@gmail.com>
Signed-off-by: Huy Do <huydhn@gmail.com>
Signed-off-by: Huy Do <huydhn@gmail.com>
Signed-off-by: Huy Do <huydhn@gmail.com>
Signed-off-by: Huy Do <huydhn@gmail.com>
Signed-off-by: Huy Do <huydhn@gmail.com>
@huydhn huydhn requested a review from ZainRizvi July 3, 2025 18:10
@huydhn huydhn marked this pull request as ready for review July 3, 2025 18:10
@huydhn huydhn requested a review from a team as a code owner July 3, 2025 18:10
@huydhn huydhn requested a review from nWEIdia July 3, 2025 18:11
Copy link
Collaborator

@nWEIdia nWEIdia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@huydhn huydhn requested a review from seemethere July 4, 2025 00:46
@huydhn
Copy link
Contributor Author

huydhn commented Jul 4, 2025

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 4, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / macos-py3-arm64 / build

Details for Dev Infra team Raised by workflow job

huydhn added 3 commits July 4, 2025 12:26
Signed-off-by: Huy Do <huydhn@gmail.com>
Signed-off-by: Huy Do <huydhn@gmail.com>
Signed-off-by: Huy Do <huydhn@gmail.com>
@huydhn
Copy link
Contributor Author

huydhn commented Jul 4, 2025

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull request Jul 7, 2025
With PR #157341 land, it broken the PXU CI test on sccache which has been disabled by #143851. Re-disable it
Pull Request resolved: #157693
Approved by: https://github.com/atalman, https://github.com/huydhn
pytorchmergebot pushed a commit that referenced this pull request Jul 9, 2025
This was broken by #157341

This should resolve the permission issue
Pull Request resolved: #157826
Approved by: https://github.com/fduwjj, https://github.com/Skylion007, https://github.com/huydhn
@github-actions github-actions bot deleted the support-b200-runners branch August 4, 2025 02:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged test-config/default topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants