./run_train.sh --module gpt_oss --config gpt_oss_debugmodel --parallelism.data_parallel_shard_degree 4 --parallelism.tensor_parallel_degree 2 --parallelism.expert_parallel_degree 4 --parallelism.expert_tensor_parallel_degree 1 --compile.enable
traceback : Traceback (most recent call last):
File "/home/lty/local/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 367, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/lty/local/torchtitan/torchtitan/trainer.py", line 835, in train
self.train_step(data_iterator)
File "/home/lty/local/torchtitan/torchtitan/trainer.py", line 745, in train_step
loss = self.forward_backward_step(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lty/local/torchtitan/torchtitan/trainer.py", line 703, in forward_backward_step
loss.backward()
File "/home/lty/local/pytorch/torch/_tensor.py", line 631, in backward
torch.autograd.backward(
File "/home/lty/local/pytorch/torch/autograd/__init__.py", line 379, in backward
_engine_run_backward(
File "/home/lty/local/pytorch/torch/autograd/graph.py", line 877, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lty/local/pytorch/torch/autograd/function.py", line 317, in apply
return user_fn(self, *args)
^^^^^^^^^^^^^^^^^^^^
File "/home/lty/local/pytorch/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2779, in backward
all_args = _backward_prologue_functional(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lty/local/pytorch/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2054, in _backward_prologue_functional
flat_processed_tangents = list(
^^^^^
File "/home/lty/local/pytorch/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2057, in <genexpr>
AOTDispatchAutograd.process_runtime_tangent(
File "/home/lty/local/pytorch/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2486, in process_runtime_tangent
new_elem, elem_leaves = AOTDispatchAutograd.process_runtime_tangent(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lty/local/pytorch/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2449, in process_runtime_tangent
raise AOTDispatchAutograd._raise_tangent_metadata_error(
RuntimeError:
During the backward, we encountered a tensor subclass where we guessed its
metadata incorrectly.
Expected a AsyncCollectiveTensor tangent but got a plain Tensor.
This happens when a compiled function returns multiple outputs that
require gradients, but .backward() is only called on some of them.
To fix: call .detach() on forward outputs you don't need gradients for.
This error is also more likely to occur if your compiled model is suffering
from a large number of graph breaks. For more advice on finding and fixing
graph breaks, see:
https://docs.pytorch.org/docs/stable/user_guide/torch_compiler/compile/programming_model.graph_breaks_index.html
For more info about this error, see:
https://github.com/pytorch/pytorch/issues/172556
Bug description
not sure if related to #2771
error msg:
Versions
latest pytorch & torchtitan