Skip to content

gpt-oss + compile breaking CI #2776

@tianyu-l

Description

@tianyu-l

Bug description

./run_train.sh --module gpt_oss --config gpt_oss_debugmodel --parallelism.data_parallel_shard_degree 4 --parallelism.tensor_parallel_degree 2 --parallelism.expert_parallel_degree 4 --parallelism.expert_tensor_parallel_degree 1 --compile.enable

not sure if related to #2771

error msg:

  traceback : Traceback (most recent call last):
    File "/home/lty/local/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 367, in wrapper
      return f(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^
    File "/home/lty/local/torchtitan/torchtitan/trainer.py", line 835, in train
      self.train_step(data_iterator)
    File "/home/lty/local/torchtitan/torchtitan/trainer.py", line 745, in train_step
      loss = self.forward_backward_step(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/lty/local/torchtitan/torchtitan/trainer.py", line 703, in forward_backward_step
      loss.backward()
    File "/home/lty/local/pytorch/torch/_tensor.py", line 631, in backward
      torch.autograd.backward(
    File "/home/lty/local/pytorch/torch/autograd/__init__.py", line 379, in backward
      _engine_run_backward(
    File "/home/lty/local/pytorch/torch/autograd/graph.py", line 877, in _engine_run_backward
      return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/lty/local/pytorch/torch/autograd/function.py", line 317, in apply
      return user_fn(self, *args)
             ^^^^^^^^^^^^^^^^^^^^
    File "/home/lty/local/pytorch/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2779, in backward
      all_args = _backward_prologue_functional(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/lty/local/pytorch/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2054, in _backward_prologue_functional
      flat_processed_tangents = list(
                                ^^^^^
    File "/home/lty/local/pytorch/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2057, in <genexpr>
      AOTDispatchAutograd.process_runtime_tangent(
    File "/home/lty/local/pytorch/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2486, in process_runtime_tangent
      new_elem, elem_leaves = AOTDispatchAutograd.process_runtime_tangent(
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/lty/local/pytorch/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2449, in process_runtime_tangent
      raise AOTDispatchAutograd._raise_tangent_metadata_error(
  RuntimeError: 
  During the backward, we encountered a tensor subclass where we guessed its
  metadata incorrectly.
  Expected a AsyncCollectiveTensor tangent but got a plain Tensor.
  This happens when a compiled function returns multiple outputs that
  require gradients, but .backward() is only called on some of them.
  To fix: call .detach() on forward outputs you don't need gradients for.
  
  This error is also more likely to occur if your compiled model is suffering
  from a large number of graph breaks. For more advice on finding and fixing
  graph breaks, see:
  https://docs.pytorch.org/docs/stable/user_guide/torch_compiler/compile/programming_model.graph_breaks_index.html
  
  For more info about this error, see:
  https://github.com/pytorch/pytorch/issues/172556

Versions

latest pytorch & torchtitan

Metadata

Metadata

Assignees

Type

No fields configured for Bug.

Projects

Status
Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions