Skip to content

CI: rocm (default, 1, 3, linux.rocm.gpu) is very slow #110181

@jon-chuang

Description

@jon-chuang

Current Status

ongoing

Issue looks like

Taking ~2.5 hours on #110167
image

Taking 3.5+ hours on #109976
image

User impact

Slower merging

Root cause

Seems like it may have been introduced in #109817 @malfet
image

Mitigation

Not sure

Prevention/followups

Investigate cause of slow running time or split up tests into smaller test jobs. Try to make the tests run in similar time to CUDA tests (~1.5 hours)

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @seemethere @malfet @pytorch/pytorch-dev-infra @ZainRizvi @kit1980 @huydhn @clee2000

### Tasks

Metadata

Metadata

Labels

module: ciRelated to continuous integrationmodule: devxRelated to PyTorch contribution experience (HUD, pytorchbot)module: rocmAMD GPU support for PytorchtriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

Status

Done

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions