Skip to content

test/distributed/test_c10d.py::RendezvousEnvTest::test_common_errors is failing sometimes #53526

@samestep

Description

@samestep

Error message (from this job):

Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1492, in wrapper
    return func(*args, **kwargs)
  File "distributed/test_c10d.py", line 507, in test_common_errors
    next(gen)
AssertionError: ValueError not raised

According to the HUD, this is the timeline:

  • c0adabe test started failing
  • f595ba1 test stopped failing
  • 8c798e0 test started failing again
  • 1fe6a65 test switched from shard 2 to shard 1

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @agolynski @SciPioneer @H-Huang @mrzzd @cbalioglu

Metadata

Metadata

Assignees

Labels

oncall: distributedAdd this issue/PR to distributed oncall triage queue

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions