Skip to content

SIGSEGV when exiting the dataloader in the middle of training #6246

@yitongh

Description

@yitongh

A SIGSEGV error occurs when running the following code with PJRT_DEVICE=CUDA torchrun --nproc_per_node=2 test.py.

import torch
import torch_xla.core.xla_model as xm
import torch_xla.distributed.parallel_loader as pl

import os
os.environ['PJRT_LOCAL_PROCESS_RANK'] = os.environ['LOCAL_RANK']

device = xm.xla_device()
xm.set_replication(device, [device])
import torch_xla.utils.utils as xu
train_loader = xu.SampleGenerator(
      data=torch.zeros(1, 12),
      sample_count=1024)
train_loader = pl.MpDeviceLoader(train_loader, device)
max_steps = 10
for step, inputs in enumerate(train_loader):
  xm.all_reduce('sum', [inputs], scale=1.0/xm.xrt_world_size())

  if step > max_steps: break

This is due to early exit from the dataloader, causing the xm.mark_step in pl.MpDeviceLoader to not be executed. As a result, all_reduce_token is not set to None, which causes all_reduce_token in the global variable g_all_reduce_tokens in torch_xla/csrc/cross_replica_reduces.cpp to be released only when the program exits.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions