Make DeviceMesh opaque#169867
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/169867
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 07d0720 with merge base dbf7019 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
cc ezyang EikanWang jgong5 wenzhe-nrv [ghstack-poisoned]
cc ezyang EikanWang jgong5 wenzhe-nrv [ghstack-poisoned]
cc ezyang EikanWang jgong5 wenzhe-nrv [ghstack-poisoned]
By marking DTensorSpec as being a value-type opaque object, the graph looks like:
```
def forward(self, b_buffer, x):
_assert_tensor_metadata_default = torch.ops.aten._assert_tensor_metadata.default(x, dtype = torch.float64, device = device(type='cpu'), layout = torch.strided); _assert_tensor_metadata_default = None
to = torch.ops.aten.to.dtype_layout(x, dtype = torch.float64, layout = torch.strided, device = device(type='{self.device_type}')); x = None
view_as = torch.ops.aten.view_as.default(to, to); to = None
dtensor___init__0 = self.dtensor___init__0
dtensor_const_func_spec0 = self.dtensor_const_func_spec0
flat_apply = torch.ops.higher_order.flat_apply(dtensor_const_func_spec0, dtensor___init__0, view_as, DTensorSpec(mesh=DeviceMesh('{self.device_type}', [0, 1]), placements=(Shard(dim=0),), tensor_meta=TensorMeta(shape=torch.Size([8, 4]), stride=(4, 1), dtype=torch.float64), shard_order=(ShardOrderEntry(tensor_dim=0, mesh_dims=(0,)),)), False); dtensor_const_func_spec0 = dtensor___init__0 = view_as = None
add = torch.ops.aten.add.Tensor(b_buffer, flat_apply); b_buffer = flat_apply = None
access_subclass_inner_tensor_default_4 = torch.ops.export.access_subclass_inner_tensor.default(add, '_local_tensor'); add = None
view_as_1 = torch.ops.aten.view_as.default(access_subclass_inner_tensor_default_4, access_subclass_inner_tensor_default_4); access_subclass_inner_tensor_default_4 = None
return (view_as_1,)""", # noqa: B950
)
```
cc ezyang EikanWang jgong5 wenzhe-nrv
[ghstack-poisoned]
| Returns FX-evaluable repr and required globals for Shard placement. | ||
| Needed for passing this type as an opaque object input to a custom op. | ||
| """ | ||
| return f"torch.distributed.tensor.placement_types.Shard(dim={self.dim})", {} |
There was a problem hiding this comment.
OOC, how do you deal with situations where you need to trigger extra imports? Is that what the rhs is?
There was a problem hiding this comment.
yes! the right hand side is expected to be a mapping of the FQN used in the repr -> the type itself, so then we will add this mapping to the globals of the FX graph or inductor code. But in the case of opaque objects in the torch namespace, since torch is already in the globals so we dont need to add additional imports
By marking DTensorSpec as being a value-type opaque object, the graph looks like:
```python
def forward(self, b_buffer, x):
_assert_tensor_metadata_default = torch.ops.aten._assert_tensor_metadata.default(x, dtype = torch.float64, device = device(type='cpu'), layout = torch.strided); _assert_tensor_metadata_default = None
to = torch.ops.aten.to.dtype_layout(x, dtype = torch.float64, layout = torch.strided, device = device(type='cuda')); x = None
view_as = torch.ops.aten.view_as.default(to, to); to = None
dtensor___init__0 = self.dtensor___init__0
dtensor_const_func_spec0 = self.dtensor_const_func_spec0
flat_apply = torch.ops.higher_order.flat_apply(
dtensor_const_func_spec0,
dtensor___init__0,
view_as,
torch.distributed.tensor._dtensor_spec.DTensorSpec(
mesh=torch.distributed.device_mesh.DeviceMesh('cuda', [0, 1]),
placements=(torch.distributed.tensor.placement_types.Shard(dim=0),),
tensor_meta=torch.distributed.tensor._dtensor_spec.TensorMeta(shape=torch.Size([8, 4]), stride=(4, 1), dtype=torch.float64),
shard_order=(torch.distributed.tensor._dtensor_spec.ShardOrderEntry(tensor_dim=0, mesh_dims=(0,)),)),
False);
add = torch.ops.aten.add.Tensor(b_buffer, flat_apply); b_buffer = flat_apply = None
access_subclass_inner_tensor_default_4 = torch.ops.export.access_subclass_inner_tensor.default(add, '_local_tensor'); add = None
view_as_1 = torch.ops.aten.view_as.default(access_subclass_inner_tensor_default_4, access_subclass_inner_tensor_default_4); access_subclass_inner_tensor_default_4 = None
return (view_as_1,)
```
cc ezyang EikanWang jgong5 wenzhe-nrv
[ghstack-poisoned]
By marking DTensorSpec as being a value-type opaque object, the graph looks like:
```python
def forward(self, b_buffer, x):
_assert_tensor_metadata_default = torch.ops.aten._assert_tensor_metadata.default(x, dtype = torch.float64, device = device(type='cpu'), layout = torch.strided); _assert_tensor_metadata_default = None
to = torch.ops.aten.to.dtype_layout(x, dtype = torch.float64, layout = torch.strided, device = device(type='cuda')); x = None
view_as = torch.ops.aten.view_as.default(to, to); to = None
dtensor___init__0 = self.dtensor___init__0
dtensor_const_func_spec0 = self.dtensor_const_func_spec0
flat_apply = torch.ops.higher_order.flat_apply(
dtensor_const_func_spec0,
dtensor___init__0,
view_as,
torch.distributed.tensor._dtensor_spec.DTensorSpec(
mesh=torch.distributed.device_mesh.DeviceMesh('cuda', [0, 1]),
placements=(torch.distributed.tensor.placement_types.Shard(dim=0),),
tensor_meta=torch.distributed.tensor._dtensor_spec.TensorMeta(shape=torch.Size([8, 4]), stride=(4, 1), dtype=torch.float64),
shard_order=(torch.distributed.tensor._dtensor_spec.ShardOrderEntry(tensor_dim=0, mesh_dims=(0,)),)),
False);
add = torch.ops.aten.add.Tensor(b_buffer, flat_apply); b_buffer = flat_apply = None
access_subclass_inner_tensor_default_4 = torch.ops.export.access_subclass_inner_tensor.default(add, '_local_tensor'); add = None
view_as_1 = torch.ops.aten.view_as.default(access_subclass_inner_tensor_default_4, access_subclass_inner_tensor_default_4); access_subclass_inner_tensor_default_4 = None
return (view_as_1,)
```
cc ezyang EikanWang jgong5 wenzhe-nrv voznesenskym penguinwu Guobing-Chen XiaobingSuper zhuhaozhe blzheng jiayisunx kadeng chauhang amjames Lucaskabela jataylo
[ghstack-poisoned]
By marking DTensorSpec as being a value-type opaque object, the dynamo graph now looks like:
```python
def forward(self, L_x_ : torch.Tensor, L_mesh_ : torch.distributed.device_mesh.DeviceMesh):
l_x_ = L_x_
l_mesh_ = L_mesh_
dt = torch.distributed.tensor._api.from_local(l_x_, l_mesh_, [torch.distributed.tensor.placement_types.Shard(dim=0)], run_check = False); l_x_ = None
redistribute = dt.redistribute(device_mesh = l_mesh_, placements = [torch.distributed.tensor.placement_types.Replicate()]); dt = l_mesh_ = None
to_local = redistribute.to_local(); redistribute = None
add = to_local + 2; to_local = None
return (add,)
```
It takes in the DeviceMesh as an input (since it is marked as reference-type opaque object), and calls from_local directly (since it is marked as TorchInGraphFunctionVariable), and to_local/redistribute as call_methods on the tensors
The AOTAutograd graph decomposes the from_local/to_local/redistribute operations and looks like:
```
def forward(self, arg0_1, arg1_1):
_to_copy = torch.ops.aten._to_copy.default(arg0_1, dtype = torch.float32, layout = torch.strided, device = device(type='cuda', index=0)); arg0_1 = None
view = torch.ops.aten.view.default(_to_copy, [1]); _to_copy = None
all_gather_into_tensor = torch.ops._c10d_functional.all_gather_into_tensor.default(view, 2, '0'); view = None
wait_tensor = torch.ops._c10d_functional.wait_tensor.default(all_gather_into_tensor); all_gather_into_tensor = None
view_1 = torch.ops.aten.view.default(wait_tensor, [2]); wait_tensor = None
add = torch.ops.aten.add.Tensor(view_1, 2); view_1 = None
return (add,)
```
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo Lucaskabela ezyang
[ghstack-poisoned]
By marking DTensorSpec as being a value-type opaque object, the dynamo graph now looks like:
```python
def forward(self, L_x_ : torch.Tensor, L_mesh_ : torch.distributed.device_mesh.DeviceMesh):
l_x_ = L_x_
l_mesh_ = L_mesh_
dt = torch.distributed.tensor._api.from_local(l_x_, l_mesh_, [torch.distributed.tensor.placement_types.Shard(dim=0)], run_check = False); l_x_ = None
redistribute = dt.redistribute(device_mesh = l_mesh_, placements = [torch.distributed.tensor.placement_types.Replicate()]); dt = l_mesh_ = None
to_local = redistribute.to_local(); redistribute = None
add = to_local + 2; to_local = None
return (add,)
```
It takes in the DeviceMesh as an input (since it is marked as reference-type opaque object), and calls from_local directly (since it is marked as TorchInGraphFunctionVariable), and to_local/redistribute as call_methods on the tensors
The AOTAutograd graph decomposes the from_local/to_local/redistribute operations and looks like:
```
def forward(self, arg0_1, arg1_1):
_to_copy = torch.ops.aten._to_copy.default(arg0_1, dtype = torch.float32, layout = torch.strided, device = device(type='cuda', index=0)); arg0_1 = None
view = torch.ops.aten.view.default(_to_copy, [1]); _to_copy = None
all_gather_into_tensor = torch.ops._c10d_functional.all_gather_into_tensor.default(view, 2, '0'); view = None
wait_tensor = torch.ops._c10d_functional.wait_tensor.default(all_gather_into_tensor); all_gather_into_tensor = None
view_1 = torch.ops.aten.view.default(wait_tensor, [2]); wait_tensor = None
add = torch.ops.aten.add.Tensor(view_1, 2); view_1 = None
return (add,)
```
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo Lucaskabela ezyang
[ghstack-poisoned]
|
@angelayi has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
This reverts commit 4416f11. Reverted #175510 on behalf of https://github.com/huydhn due to Per our discussion with @angelayi, revert this so that #169867 can be reverted, it is breaking a bunch of internal tests ([comment](#175510 (comment)))
Unsure whats the best way to fix things but this seems to work! Pull Request resolved: #175510 Approved by: https://github.com/azahed98
|
@pytorchbot revert -m 'Per our discussion with @angelayi, revert this as it is breaking a bunch of internal tests' -c ghfirst |
|
@pytorchbot successfully started a revert job. Check the current status here. |
Reverting PR 169867 failedReason: Command Details for Dev Infra teamRaised by workflow job |
This reverts commit 4416f11. Reverted pytorch#175510 on behalf of https://github.com/huydhn due to Per our discussion with @angelayi, revert this so that pytorch#169867 can be reverted, it is breaking a bunch of internal tests ([comment](pytorch#175510 (comment)))
Reverts DeviceMesh tracing changes in pytorch#169867 Pull Request resolved: pytorch#176485 Approved by: https://github.com/huydhn
By marking DTensorSpec as being a value-type opaque object, the dynamo graph now looks like:
```python
def forward(self, L_x_ : torch.Tensor, L_mesh_ : torch.distributed.device_mesh.DeviceMesh):
l_x_ = L_x_
l_mesh_ = L_mesh_
dt = torch.distributed.tensor._api.from_local(l_x_, l_mesh_, [torch.distributed.tensor.placement_types.Shard(dim=0)], run_check = False); l_x_ = None
redistribute = dt.redistribute(device_mesh = l_mesh_, placements = [torch.distributed.tensor.placement_types.Replicate()]); dt = l_mesh_ = None
to_local = redistribute.to_local(); redistribute = None
add = to_local + 2; to_local = None
return (add,)
```
It takes in the DeviceMesh as an input (since it is marked as reference-type opaque object), and calls from_local directly (since it is marked as TorchInGraphFunctionVariable), and to_local/redistribute as call_methods on the tensors
The AOTAutograd graph decomposes the from_local/to_local/redistribute operations and looks like:
```
def forward(self, arg0_1, arg1_1):
_to_copy = torch.ops.aten._to_copy.default(arg0_1, dtype = torch.float32, layout = torch.strided, device = device(type='cuda', index=0)); arg0_1 = None
view = torch.ops.aten.view.default(_to_copy, [1]); _to_copy = None
all_gather_into_tensor = torch.ops._c10d_functional.all_gather_into_tensor.default(view, 2, '0'); view = None
wait_tensor = torch.ops._c10d_functional.wait_tensor.default(all_gather_into_tensor); all_gather_into_tensor = None
view_1 = torch.ops.aten.view.default(wait_tensor, [2]); wait_tensor = None
add = torch.ops.aten.add.Tensor(view_1, 2); view_1 = None
return (add,)
```
Differential Revision: [D94288113](https://our.internmc.facebook.com/intern/diff/D94288113)
Pull Request resolved: pytorch#169867
Approved by: https://github.com/ezyang
This reverts commit 4416f11. Reverted pytorch#175510 on behalf of https://github.com/huydhn due to Per our discussion with @angelayi, revert this so that pytorch#169867 can be reverted, it is breaking a bunch of internal tests ([comment](pytorch#175510 (comment)))
Reverts DeviceMesh tracing changes in pytorch#169867 Pull Request resolved: pytorch#176485 Approved by: https://github.com/huydhn
By marking DTensorSpec as being a value-type opaque object, the dynamo graph now looks like:
It takes in the DeviceMesh as an input (since it is marked as reference-type opaque object), and calls from_local directly (since it is marked as TorchInGraphFunctionVariable), and to_local/redistribute as call_methods on the tensors
The AOTAutograd graph decomposes the from_local/to_local/redistribute operations and looks like:
Stack from ghstack (oldest at bottom):
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo @Lucaskabela @ezyang
Differential Revision: D94288113