A SIGSEGV error occurs when running the following code with PJRT_DEVICE=CUDA torchrun --nproc_per_node=2 test.py.
import torch
import torch_xla.core.xla_model as xm
import torch_xla.distributed.parallel_loader as pl
import os
os.environ['PJRT_LOCAL_PROCESS_RANK'] = os.environ['LOCAL_RANK']
device = xm.xla_device()
xm.set_replication(device, [device])
import torch_xla.utils.utils as xu
train_loader = xu.SampleGenerator(
data=torch.zeros(1, 12),
sample_count=1024)
train_loader = pl.MpDeviceLoader(train_loader, device)
max_steps = 10
for step, inputs in enumerate(train_loader):
xm.all_reduce('sum', [inputs], scale=1.0/xm.xrt_world_size())
if step > max_steps: break
This is due to early exit from the dataloader, causing the xm.mark_step in pl.MpDeviceLoader to not be executed. As a result, all_reduce_token is not set to None, which causes all_reduce_token in the global variable g_all_reduce_tokens in torch_xla/csrc/cross_replica_reduces.cpp to be released only when the program exits.
A SIGSEGV error occurs when running the following code with
PJRT_DEVICE=CUDA torchrun --nproc_per_node=2 test.py.This is due to early exit from the dataloader, causing the
xm.mark_stepin pl.MpDeviceLoader to not be executed. As a result,all_reduce_tokenis not set to None, which causesall_reduce_tokenin the global variableg_all_reduce_tokensintorch_xla/csrc/cross_replica_reduces.cppto be released only when the program exits.