[WIP][3/N]/dtensor] allow distribute_tensor to shard DTensor#130551
[WIP][3/N]/dtensor] allow distribute_tensor to shard DTensor#130551XilunWu wants to merge 4 commits intogh/XilunWu/88/basefrom
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/130551
Note: Links to docs will display an error until the docs builds have been completed. ❌ 5 New Failures, 1 Cancelled Job, 1 Unrelated FailureAs of commit 2711050 with merge base dc7725c ( NEW FAILURES - The following jobs have failed:
CANCELLED JOB - The following job was cancelled. Please retry:
FLAKY - The following job failed but was likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
| right_to_left_distribute = ( | ||
| device_mesh.ndim == 1 # the device mesh arg must be a 1-d mesh | ||
| and len(placements) == 1 # must be True for 1-d mesh | ||
| and input_parent_mesh is not None | ||
| and input_parent_mesh == dtensor_parent_mesh # distribute over the same mesh | ||
| ) |
There was a problem hiding this comment.
Can we avoid using another indention and instead move the below return tensor and raise exception here? Then we don't need another indention.
| "Calling distribute_tensor() on DTensor objects require the " | ||
| f"placement be a Shard() placement. Input args: tensor={tensor}, " | ||
| f"device_mesh={device_mesh}, placements={placements}." | ||
| ) |
There was a problem hiding this comment.
replicate() is also required as we need to support HSDP?
| # mesh, and potentially so on. | ||
| input_parent_mesh = _mesh_resources.get_parent_mesh(device_mesh) | ||
| dtensor_parent_mesh = _mesh_resources.get_parent_mesh(tensor.device_mesh) | ||
| right_to_left_distribute = ( |
There was a problem hiding this comment.
We don't actually check if we are sharding from right to left. If the input mesh is actually in the right of the tensor's submesh, we should raise an exception.
wanchaol
left a comment
There was a problem hiding this comment.
Overall I think this is valuable to add to distribute_tensor, but only after the strided shard placement get matured (i.e. all DTensor ops are working with this placement or we merge this placement back to Shard placemnet).
Before that happens, I think we should focus on making FSDP2 + TP to use it first (i.e. add the integration to FSDP init shard param path)
|
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
Stack from ghstack (oldest at bottom):
cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o