You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
So the barrier is coming from here. This is strange that barrier is being called - I think this means that if not self._device_type == DeviceType.TPU is mistakenly evaluating to True? I think pytorch lightning spins up 8 processes for 8 TPU cores, is it possible only some of them are evaluating to True?
Basically it seems like at least 1 process is not making it to this point, which means the other processes are waiting in the barrier and the meetup never happens so we get the RuntimeError shown.
2. Looks like a call to xm.save() is being misused:
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 330, in _mp_start_fn
_start_fn(index, pf_cfg, fn, args)
File "/opt/conda/lib/python3.7/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 330, in _mp_start_fn
_start_fn(index, pf_cfg, fn, args)
File "/opt/conda/lib/python3.7/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 324, in _start_fn
fn(gindex, *args)
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/tpu_spawn.py", line 103, in new_process
self.transfer_distrib_spawn_state_on_fit_end(results)
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/tpu_spawn.py", line 129, in transfer_distrib_spawn_state_on_fit_end
xm.save(self.lightning_module.state_dict(), last_path)
File "/opt/conda/lib/python3.7/site-packages/torch_xla/core/xla_model.py", line 817, in save
rendezvous('torch_xla.core.xla_model.save')
File "/opt/conda/lib/python3.7/site-packages/torch_xla/core/xla_model.py", line 861, in rendezvous
return torch_xla._XLAC._xla_rendezvous(get_ordinal(), tag, payload, replicas)
File "/opt/conda/lib/python3.7/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 324, in _start_fn
fn(gindex, *args)
RuntimeError: tensorflow/compiler/xla/xla_client/mesh_service.cc:364 : Failed to meet rendezvous 'torch_xla.core.xla_model.save': Socket closed (14)
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/tpu_spawn.py", line 103, in new_process
self.transfer_distrib_spawn_state_on_fit_end(results)
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/tpu_spawn.py", line 129, in transfer_distrib_spawn_state_on_fit_end
xm.save(self.lightning_module.state_dict(), last_path)
File "/opt/conda/lib/python3.7/site-packages/torch_xla/core/xla_model.py", line 817, in save
rendezvous('torch_xla.core.xla_model.save')
File "/opt/conda/lib/python3.7/site-packages/torch_xla/core/xla_model.py", line 861, in rendezvous
return torch_xla._XLAC._xla_rendezvous(get_ordinal(), tag, payload, replicas)
Exception in device=TPU:6: tensorflow/compiler/xla/xla_client/mesh_service.cc:364 : Failed to meet rendezvous 'torch_xla.core.xla_model.save': Socket closed (14)Exception in device=TPU:3: tensorflow/compiler/xla/xla_client/mesh_service.cc:364 : Failed to meet rendezvous 'torch_xla.core.xla_model.save': Socket closed (14)
I think the problem is here with the usage of xm.save().
xm.save() already handles the multiprocess case by checking the ordinal and only writing to disk if the process is on the master ordinal. In general, if you surround xm.save() with if statements, it means some TPU cores enter the if statement and some will not, so the cores that entered the if statement will be waiting for those that didn't enter and eventually it will time out and crash.
Repro methods
1. (Colab) Make 3 modifications to the BoringModel
3. (Your CI setup) Modify TPU unit tests as follows:
Add a trainer.test(test_dataloaders=DataLoader(RandomDataset(32, 2000), batch_size=32)) after some call to trainer.fit
For example, I changed test_model_tpu_early_stop test to look like this:
@pytest.mark.skipif(not _TPU_AVAILABLE, reason="test requires TPU machine")
@pl_multi_process_test
def test_model_tpu_early_stop(tmpdir):
"""Test if single TPU core training works"""
# todo: Test on 8 cores - hanging.
class CustomBoringModel(BoringModel):
def validation_step(self, *args, **kwargs):
out = super().validation_step(*args, **kwargs)
self.log('val_loss', out['x'])
return out
tutils.reset_seed()
model = CustomBoringModel()
trainer = Trainer(
callbacks=[EarlyStopping(monitor='val_loss')],
default_root_dir=tmpdir,
progress_bar_refresh_rate=0,
max_epochs=2,
limit_train_batches=2,
limit_val_batches=2,
tpu_cores=[1],
)
trainer.fit(model)
+ trainer.test(test_dataloaders=DataLoader(RandomDataset(32, 2000), batch_size=32))
Ran tests with coverage run --source=pytorch_lightning -m pytest tests/models/test_tpu.py -v. This should allow testing on the CI framework
Environment
PyTorch Version (e.g., 1.0): 1.7
OS (e.g., Linux): Linux
Build command you used (if compiling from source): pip install pytorch-lightning==1.2.1 (note that earlier versions hang due to Hanging with TPUs on GCE VM #5841 )
🐛 Bug
trainer.test()does not work with TPUs.There are a few different ways we've seen it crash.
1. Looks like a call to
barrier()coming from__test_using_best_weightsSo the barrier is coming from here. This is strange that
barrieris being called - I think this means thatif not self._device_type == DeviceType.TPUis mistakenly evaluating toTrue? I think pytorch lightning spins up 8 processes for 8 TPU cores, is it possible only some of them are evaluating toTrue?Basically it seems like at least 1 process is not making it to this point, which means the other processes are waiting in the barrier and the meetup never happens so we get the
RuntimeErrorshown.2. Looks like a call to
xm.save()is being misused:I think the problem is here with the usage of
xm.save().xm.save()already handles the multiprocess case by checking the ordinal and only writing to disk if the process is on the master ordinal. In general, if you surroundxm.save()withifstatements, it means some TPU cores enter theifstatement and some will not, so the cores that entered theifstatement will be waiting for those that didn't enter and eventually it will time out and crash.Repro methods
1. (Colab) Make 3 modifications to the BoringModel
tpu_cores=8to the trainer cell2. (Google Cloud) Use the attached repro.py file in the following way:
conda activate torch-xla-1.7pip install pytorch-lightning==1.2.1export TPU_IP_ADDRESS=my.tpu.ip.addrexport XRT_TPU_CONFIG="tpu_worker;0;$TPU_IP_ADDRESS:8470"python3 repro.py3. (Your CI setup) Modify TPU unit tests as follows:
trainer.test(test_dataloaders=DataLoader(RandomDataset(32, 2000), batch_size=32))after some call totrainer.fitRan tests with
coverage run --source=pytorch_lightning -m pytest tests/models/test_tpu.py -v. This should allow testing on the CI frameworkEnvironment
pip install pytorch-lightning==1.2.1(note that earlier versions hang due to Hanging with TPUs on GCE VM #5841 )