SIGSEGV when exiting the dataloader in the middle of training

A SIGSEGV error occurs when running the following code with `PJRT_DEVICE=CUDA torchrun --nproc_per_node=2 test.py`.
```
import torch
import torch_xla.core.xla_model as xm
import torch_xla.distributed.parallel_loader as pl

import os
os.environ['PJRT_LOCAL_PROCESS_RANK'] = os.environ['LOCAL_RANK']

device = xm.xla_device()
xm.set_replication(device, [device])
import torch_xla.utils.utils as xu
train_loader = xu.SampleGenerator(
      data=torch.zeros(1, 12),
      sample_count=1024)
train_loader = pl.MpDeviceLoader(train_loader, device)
max_steps = 10
for step, inputs in enumerate(train_loader):
  xm.all_reduce('sum', [inputs], scale=1.0/xm.xrt_world_size())

  if step > max_steps: break
```

This is due to early exit from the dataloader, causing the `xm.mark_step` in pl.MpDeviceLoader to not be executed. As a result, `all_reduce_token` is not set to None, which causes `all_reduce_token` in the global variable `g_all_reduce_tokens` in `torch_xla/csrc/cross_replica_reduces.cpp` to be released only when the program exits.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SIGSEGV when exiting the dataloader in the middle of training #6246

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

SIGSEGV when exiting the dataloader in the middle of training #6246

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions