preserve node stacktraces from compiled autograd through AOTDispatcher, due to GmWrapper by bdhirsh · Pull Request #133574 · pytorch/pytorch

bdhirsh · 2024-08-15T14:59:00Z

New log output from the repro:

(/home/hirsheybar/local/b/pytorch-env) [hirsheybar@devgpu001.lla3 ~/local/b/pytorch (compiled_autograd_stacktraces)]$ TORCH_LOGS="compiled_autograd_verbose,aot" python tmp5.py
INFO: TRACED GRAPH
 ===== Joint graph 0 =====
 /home/hirsheybar/local/b/pytorch/torch/fx/_lazy_graph_module.py class joint_helper(torch.nn.Module):
    def forward(self, primals, tangents):
        primals_1: "f32[4, 4][4, 1]cpu"; tangents_1: "f32[4, 4][4, 1]cpu";

        primals_1, tangents_1, = fx_pytree.tree_flatten_spec([primals, tangents], self._in_spec)
         # File: /home/hirsheybar/local/b/pytorch/tmp5.py:6 in f, code: return torch.matmul(x, x)
        mm: "f32[4, 4][4, 1]cpu" = torch.ops.aten.mm.default(primals_1, primals_1)
        permute: "f32[4, 4][1, 4]cpu" = torch.ops.aten.permute.default(primals_1, [1, 0])
        mm_1: "f32[4, 4][4, 1]cpu" = torch.ops.aten.mm.default(permute, tangents_1);  permute = None
        permute_1: "f32[4, 4][1, 4]cpu" = torch.ops.aten.permute.default(primals_1, [1, 0]);  primals_1 = None
        mm_2: "f32[4, 4][4, 1]cpu" = torch.ops.aten.mm.default(tangents_1, permute_1);  tangents_1 = permute_1 = None

         # File: /home/hirsheybar/local/b/pytorch/tmp5.py:6 in f, code: return torch.matmul(x, x)
        add: "f32[4, 4][4, 1]cpu" = torch.ops.aten.add.Tensor(mm_2, mm_1);  mm_2 = mm_1 = None
        return pytree.tree_unflatten([mm, add], self._out_spec)


INFO: aot_config id: 0, fw_metadata=ViewAndMutationMeta(input_info=[InputAliasInfo(is_leaf=True, mutates_data=False, mutates_metadata=False, mutations_hidden_from_autograd=True, mutations_under_no_grad_or_inference_mode=False, mutation_inductor_storage_resize=False, mutates_storage_metadata=False, requires_grad=True, keep_input_mutations=False)], output_info=[OutputAliasInfo(output_type=<OutputType.non_alias: 1>, raw_type=<class 'torch._subclasses.functional_tensor.FunctionalTensor'>, base_idx=None, dynamic_dims=set(), requires_grad=True, functional_tensor=None)], num_intermediate_bases=0, keep_input_mutations=False, traced_tangents=[FakeTensor(..., size=(4, 4))], subclass_inp_meta=[0], subclass_fw_graph_out_meta=[0], subclass_tangent_meta=[0], is_train=True, traced_tangent_metas=None, num_symints_saved_for_bw=0, grad_enabled_mutation=None, deterministic=False, static_input_indices=[], tokens={}, indices_of_inputs_that_requires_grad_with_mutations_in_bw=[], bw_donated_idxs=None), inner_meta=ViewAndMutationMeta(input_info=[InputAliasInfo(is_leaf=True, mutates_data=False, mutates_metadata=False, mutations_hidden_from_autograd=True, mutations_under_no_grad_or_inference_mode=False, mutation_inductor_storage_resize=False, mutates_storage_metadata=False, requires_grad=True, keep_input_mutations=False)], output_info=[OutputAliasInfo(output_type=<OutputType.non_alias: 1>, raw_type=<class 'torch._subclasses.functional_tensor.FunctionalTensor'>, base_idx=None, dynamic_dims=set(), requires_grad=True, functional_tensor=None)], num_intermediate_bases=0, keep_input_mutations=False, traced_tangents=[FakeTensor(..., size=(4, 4))], subclass_inp_meta=[0], subclass_fw_graph_out_meta=[0], subclass_tangent_meta=[0], is_train=True, traced_tangent_metas=None, num_symints_saved_for_bw=0, grad_enabled_mutation=None, deterministic=False, static_input_indices=[], tokens={}, indices_of_inputs_that_requires_grad_with_mutations_in_bw=[], bw_donated_idxs=None)
INFO: TRACED GRAPH
 ===== Forward graph 0 =====
 /home/hirsheybar/local/b/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module):
    def forward(self, primals_1: "f32[4, 4][4, 1]cpu"):
         # File: /home/hirsheybar/local/b/pytorch/tmp5.py:6 in f, code: return torch.matmul(x, x)
        mm: "f32[4, 4][4, 1]cpu" = torch.ops.aten.mm.default(primals_1, primals_1)
        permute: "f32[4, 4][1, 4]cpu" = torch.ops.aten.permute.default(primals_1, [1, 0]);  primals_1 = None
        return (mm, permute)


INFO: TRACED GRAPH
 ===== Backward graph 0 =====
 <eval_with_key>.1 class GraphModule(torch.nn.Module):
    def forward(self, permute: "f32[4, 4][1, 4]cpu", tangents_1: "f32[4, 4][4, 1]cpu"):
         # File: /home/hirsheybar/local/b/pytorch/tmp5.py:6 in f, code: return torch.matmul(x, x)
        mm_1: "f32[4, 4][4, 1]cpu" = torch.ops.aten.mm.default(permute, tangents_1)
        mm_2: "f32[4, 4][4, 1]cpu" = torch.ops.aten.mm.default(tangents_1, permute);  tangents_1 = permute = None

         # File: /home/hirsheybar/local/b/pytorch/tmp5.py:6 in f, code: return torch.matmul(x, x)
        add: "f32[4, 4][4, 1]cpu" = torch.ops.aten.add.Tensor(mm_2, mm_1);  mm_2 = mm_1 = None
        return (add,)


DEBUG: Cache miss due to new autograd node: torch::autograd::GraphRoot (NodeCall 0) with key size 39, previous key sizes=[]
DEBUG: TRACED GRAPH
 ===== Compiled autograd graph =====
 <eval_with_key>.2 class CompiledAutograd(torch.nn.Module):
    def forward(self, inputs, sizes, scalars, hooks):
        # No stacktrace found for following nodes
        getitem: "f32[]cpu" = inputs[0]
        getitem_1: "f32[4, 4]cpu" = inputs[1]
        getitem_2: "f32[4, 4]cpu" = inputs[2];  inputs = None

         # File: /home/hirsheybar/local/b/pytorch/torch/_dynamo/compiled_autograd.py:379 in set_node_origin, code: SumBackward0 (NodeCall 1)
        expand: "f32[4, 4]cpu" = torch.ops.aten.expand.default(getitem, [4, 4]);  getitem = None

         # File: /home/hirsheybar/local/b/pytorch/torch/_dynamo/compiled_autograd.py:379 in set_node_origin, code: CompiledFunctionBackward (NodeCall 2)
        clone: "f32[4, 4]cpu" = torch.ops.aten.clone.default(expand, memory_format = torch.contiguous_format);  expand = None
        mm: "f32[4, 4]cpu" = torch.ops.aten.mm.default(getitem_1, clone)
        mm_1: "f32[4, 4]cpu" = torch.ops.aten.mm.default(clone, getitem_1);  clone = getitem_1 = None
        add: "f32[4, 4]cpu" = torch.ops.aten.add.Tensor(mm_1, mm);  mm_1 = mm = None

         # File: /home/hirsheybar/local/b/pytorch/torch/_dynamo/compiled_autograd.py:379 in set_node_origin, code: torch::autograd::AccumulateGrad (NodeCall 3)
        accumulate_grad_ = torch.ops.inductor.accumulate_grad_.default(getitem_2, add);  getitem_2 = add = accumulate_grad_ = None
        _exec_final_callbacks_stub = torch__dynamo_external_utils__exec_final_callbacks_stub();  _exec_final_callbacks_stub = None
        return []


INFO: TRACED GRAPH
 ===== Forward graph 1 =====
 /home/hirsheybar/local/b/pytorch/torch/fx/_lazy_graph_module.py class <lambda>(torch.nn.Module):
    def forward(self, arg0_1: "f32[][]cpu", arg1_1: "f32[4, 4][1, 4]cpu", arg2_1: "f32[4, 4][4, 1]cpu"):
         # File: /home/hirsheybar/local/b/pytorch/torch/_dynamo/compiled_autograd.py:379 in set_node_origin, code: SumBackward0 (NodeCall 1)
        expand: "f32[4, 4][0, 0]cpu" = torch.ops.aten.expand.default(arg0_1, [4, 4]);  arg0_1 = None

         # File: /home/hirsheybar/local/b/pytorch/torch/_dynamo/compiled_autograd.py:379 in set_node_origin, code: CompiledFunctionBackward (NodeCall 2)
        clone: "f32[4, 4][4, 1]cpu" = torch.ops.aten.clone.default(expand, memory_format = torch.contiguous_format);  expand = None
        mm: "f32[4, 4][4, 1]cpu" = torch.ops.aten.mm.default(arg1_1, clone)
        mm_1: "f32[4, 4][4, 1]cpu" = torch.ops.aten.mm.default(clone, arg1_1);  clone = arg1_1 = None
        add: "f32[4, 4][4, 1]cpu" = torch.ops.aten.add.Tensor(mm_1, mm);  mm_1 = mm = None

         # File: /home/hirsheybar/local/b/pytorch/torch/_dynamo/polyfill.py:44 in accumulate_grad, code: new_grad = torch.clone(new_grad)
        clone_1: "f32[4, 4][4, 1]cpu" = torch.ops.aten.clone.default(add);  add = None
        return (clone_1,)

The problem was that we expect the input to AOTAutograd to be a GraphModule in order to do all of the fancy stacktrace preservation logic, but we now need to handle compiled autograd passing in a GmWrapper instead (which it uses to try to preserve input boxing, so inductor can properly free activations)

Stack from ghstack (oldest at bottom):

…r, due to GmWrapper [ghstack-poisoned]

pytorch-bot · 2024-08-15T14:59:04Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/133574

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 14 New Failures, 1 Unrelated Failure

As of commit 50d5811 with merge base 454713f ():

NEW FAILURES - The following jobs have failed:

pull / linux-focal-cuda12.1-py3.10-gcc9 / test (default, 1, 5, amz2023.linux.4xlarge.nvidia.gpu) (gh)
inductor/test_distributed_patterns.py::DistributedPatternTests::test_intermediate_hook_with_closure
pull / linux-focal-cuda12.1-py3.10-gcc9 / test (default, 4, 5, amz2023.linux.4xlarge.nvidia.gpu) (gh)
inductor/test_compiled_autograd.py::TestCompiledAutograd::test_autograd_cpp_node_data_dependent
pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 1, 5, amz2023.linux.g5.4xlarge.nvidia.gpu) (gh)
inductor/test_distributed_patterns.py::DistributedPatternTests::test_fake_distributed_inductor
pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 4, 5, amz2023.linux.g5.4xlarge.nvidia.gpu) (gh)
inductor/test_compiled_autograd.py::TestCompiledAutograd::test_autograd_cpp_node
pull / linux-focal-py3.11-clang10 / test (default, 1, 4, amz2023.linux.2xlarge) (gh)
inductor/test_distributed_patterns.py::DistributedPatternTests::test_intermediate_hook_with_closure
pull / linux-focal-py3.12-clang10 / test (default, 1, 4, amz2023.linux.2xlarge) (gh)
inductor/test_distributed_patterns.py::DistributedPatternTests::test_intermediate_hook_with_closure
pull / linux-focal-py3.12-clang10 / test (default, 4, 4, amz2023.linux.2xlarge) (gh)
inductor/test_compiled_autograd.py::TestCompiledAutograd::test_autograd_cpp_node
pull / linux-focal-py3.12-clang10-experimental-split-build / test (default, 1, 3, amz2023.linux.2xlarge) (gh)
inductor/test_distributed_patterns.py::DistributedPatternTests::test_intermediate_hook_with_closure
pull / linux-focal-py3.8-clang10 / test (default, 1, 4, amz2023.linux.2xlarge) (gh)
inductor/test_distributed_patterns.py::DistributedPatternTests::test_intermediate_hook_with_closure
pull / linux-focal-py3.8-clang10 / test (default, 2, 4, amz2023.linux.2xlarge) (gh)
inductor/test_compiled_autograd.py::TestCompiledAutograd::test_autograd_cpp_node
pull / linux-jammy-py3.10-clang15-asan / test (default, 1, 6, amz2023.linux.4xlarge) (gh)
inductor/test_distributed_patterns.py::DistributedPatternTests::test_intermediate_hook_with_closure
pull / linux-jammy-py3.10-clang15-asan / test (default, 5, 6, amz2023.linux.4xlarge) (gh)
inductor/test_compiled_autograd.py::TestCompiledAutograd::test_autograd_cpp_node_data_dependent
pull / linux-jammy-py3.8-gcc11 / test (default, 1, 4, amz2023.linux.2xlarge) (gh)
inductor/test_distributed_patterns.py::DistributedPatternTests::test_intermediate_hook_with_closure
pull / linux-jammy-py3.8-gcc11 / test (default, 2, 4, amz2023.linux.2xlarge) (gh)
inductor/test_compiled_autograd.py::TestCompiledAutograd::test_autograd_cpp_node

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

inductor / linux-jammy-cpu-py3.12-gcc11-inductor-halide / test (inductor-halide, 1, 1, linux.12xlarge) (gh) (trunk failure)
inductor/test_halide.py::CpuHalideTests::test_scalar_output_cpu

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…r, due to GmWrapper ghstack-source-id: 33f57d0 Pull Request resolved: #133574

bdhirsh · 2024-08-15T17:50:15Z

CI is unhappy - from a quick look, I'm failing to ensure that the other args to the compiled backward (like hooks) are properly accounted for in the graph

xmfan · 2024-08-15T22:07:34Z

torch/_functorch/_aot_autograd/traced_function_transforms.py

+                            # to ensure args are boxed.
+                            assert params_len == 0
+                            assert len(kwargs) == 0
+                            out = PropagateUnbackedSymInts(mod_).run(args)


there's some logic in GmWrapper.forward that we'll need here:

pytorch/torch/_dynamo/utils.py

Lines 2907 to 2909 in 90d2593

def forward(self, *args):

args: List[Any] = list(args)

return self.gm(*self.unflatten_fn(args))

You should probably just have a "middleware" wrapper that uniformly takes care of unwrapping GmWrapper and modifying the calling convention, should be cleaner.

@xmfan is there a reason we HAVE to have a GmWrapper? Shouldn't custom GraphModule prelude/postlude be enough here?

What is the prelude/postlude? We use GmWrapper to work around the dynamo GraphModule needing boxed inputs, but AOTDispatcher always tracing the GraphModule with flat inputs

Code generation for fx.Graph can be overridden via _codegen field. For example, this is used to generate GraphModule that can take arbitrary pytree as argument, it manages flattening/unflattening in the body. You could potentially use a similar mechanism to implement GmWrapper. cc @suo @Chillee

I can look into it, but we probably still need GmWrapper for non-dynamo frontends that are passing in non-overriden graphs

ezyang · 2024-08-18T02:26:20Z

torch/_functorch/_aot_autograd/traced_function_transforms.py

    # https://github.com/pytorch/pytorch/issues/103569

    def functional_call(*args, **kwargs):
+        nonlocal mod


Why nonlocal? Are you assigning over mod?

ezyang · 2024-08-18T02:27:32Z

torch/_functorch/_aot_autograd/traced_function_transforms.py

+            mod_, pytree.tree_unflatten(args[:params_len], params_spec)
        ), maybe_disable_thunkify():
-            if isinstance(mod, torch.fx.GraphModule):
+            if isinstance(mod, (torch.fx.GraphModule, torch._dynamo.utils.GmWrapper)):


Why not do the test here on mod_?

yf225 · 2024-09-10T23:39:02Z

@bdhirsh I believe this would be super useful for compiled autograd debugging in general!

github-actions · 2024-11-10T00:49:29Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

preserve node stacktraces from compiled autograd through AOTDispatche…

50d5811

…r, due to GmWrapper [ghstack-poisoned]

pytorch-bot bot added the ciflow/inductor label Aug 15, 2024

bdhirsh added a commit that referenced this pull request Aug 15, 2024

preserve node stacktraces from compiled autograd through AOTDispatche…

451f015

…r, due to GmWrapper ghstack-source-id: 33f57d0 Pull Request resolved: #133574

github-actions bot requested review from SherlockNoMad, albanD, antoniojkim, ezyang and miladm August 15, 2024 14:59

This was referenced Aug 15, 2024

[compiled autograd] log aot id for CompiledFunctionBackward #133115

Closed

compiled autograd: support accumulate_grad_ in DTensor sharding #133580

Closed

compiled autograd: default fakeify backward inputs with static shapes instead of duck sizing #133581

Closed

xmfan reviewed Aug 15, 2024

View reviewed changes

ezyang reviewed Aug 18, 2024

View reviewed changes

albanD removed their request for review August 21, 2024 21:37

github-actions bot added the Stale label Nov 10, 2024

github-actions bot closed this Dec 10, 2024

github-actions bot deleted the gh/bdhirsh/605/head branch January 9, 2025 02:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

preserve node stacktraces from compiled autograd through AOTDispatcher, due to GmWrapper#133574

preserve node stacktraces from compiled autograd through AOTDispatcher, due to GmWrapper#133574
bdhirsh wants to merge 1 commit intogh/bdhirsh/605/basefrom
gh/bdhirsh/605/head

bdhirsh commented Aug 15, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Aug 15, 2024 •

edited

Loading

Uh oh!

bdhirsh commented Aug 15, 2024

Uh oh!

xmfan Aug 15, 2024

Uh oh!

ezyang Aug 18, 2024

Uh oh!

xmfan Aug 18, 2024

Uh oh!

ezyang Aug 19, 2024

Uh oh!

xmfan Aug 21, 2024

Uh oh!

ezyang Aug 18, 2024

Uh oh!

ezyang Aug 18, 2024

Uh oh!

yf225 commented Sep 10, 2024

Uh oh!

github-actions bot commented Nov 10, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	def forward(self, *args):
	args: List[Any] = list(args)
	return self.gm(*self.unflatten_fn(args))

Conversation

bdhirsh commented Aug 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/133574

❌ 14 New Failures, 1 Unrelated Failure

Uh oh!

bdhirsh commented Aug 15, 2024

Uh oh!

xmfan Aug 15, 2024

Choose a reason for hiding this comment

Uh oh!

ezyang Aug 18, 2024

Choose a reason for hiding this comment

Uh oh!

xmfan Aug 18, 2024

Choose a reason for hiding this comment

Uh oh!

ezyang Aug 19, 2024

Choose a reason for hiding this comment

Uh oh!

xmfan Aug 21, 2024

Choose a reason for hiding this comment

Uh oh!

ezyang Aug 18, 2024

Choose a reason for hiding this comment

Uh oh!

ezyang Aug 18, 2024

Choose a reason for hiding this comment

Uh oh!

yf225 commented Sep 10, 2024

Uh oh!

github-actions bot commented Nov 10, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

bdhirsh commented Aug 15, 2024 •

edited

Loading

pytorch-bot bot commented Aug 15, 2024 •

edited

Loading