Skip to content

[WIP][3/N]/dtensor] allow distribute_tensor to shard DTensor#130551

Closed
XilunWu wants to merge 4 commits intogh/XilunWu/88/basefrom
gh/XilunWu/88/head
Closed

[WIP][3/N]/dtensor] allow distribute_tensor to shard DTensor#130551
XilunWu wants to merge 4 commits intogh/XilunWu/88/basefrom
gh/XilunWu/88/head

Conversation

@pytorch-bot
Copy link

pytorch-bot bot commented Jul 11, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/130551

Note: Links to docs will display an error until the docs builds have been completed.

❌ 5 New Failures, 1 Cancelled Job, 1 Unrelated Failure

As of commit 2711050 with merge base dc7725c (image):

NEW FAILURES - The following jobs have failed:

CANCELLED JOB - The following job was cancelled. Please retry:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added ciflow/inductor oncall: distributed Add this issue/PR to distributed oncall triage queue labels Jul 11, 2024
@XilunWu XilunWu marked this pull request as draft July 11, 2024 17:08
@XilunWu XilunWu changed the title [3/N]/dtensor] allow distribute_tensor to shard DTensor [WIP][3/N]/dtensor] allow distribute_tensor to shard DTensor Jul 11, 2024
cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
XilunWu added a commit that referenced this pull request Jul 16, 2024
Comment on lines +661 to +666
right_to_left_distribute = (
device_mesh.ndim == 1 # the device mesh arg must be a 1-d mesh
and len(placements) == 1 # must be True for 1-d mesh
and input_parent_mesh is not None
and input_parent_mesh == dtensor_parent_mesh # distribute over the same mesh
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid using another indention and instead move the below return tensor and raise exception here? Then we don't need another indention.

"Calling distribute_tensor() on DTensor objects require the "
f"placement be a Shard() placement. Input args: tensor={tensor}, "
f"device_mesh={device_mesh}, placements={placements}."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replicate() is also required as we need to support HSDP?

# mesh, and potentially so on.
input_parent_mesh = _mesh_resources.get_parent_mesh(device_mesh)
dtensor_parent_mesh = _mesh_resources.get_parent_mesh(tensor.device_mesh)
right_to_left_distribute = (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't actually check if we are sharding from right to left. If the input mesh is actually in the right of the tensor's submesh, we should raise an exception.

Copy link
Collaborator

@wanchaol wanchaol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I think this is valuable to add to distribute_tensor, but only after the strided shard placement get matured (i.e. all DTensor ops are working with this placement or we merge this placement back to Shard placemnet).

Before that happens, I think we should focus on making FSDP2 + TP to use it first (i.e. add the integration to FSDP init shard param path)

@github-actions
Copy link
Contributor

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions bot added the Stale label Sep 14, 2024
@github-actions github-actions bot closed this Oct 14, 2024
@github-actions github-actions bot deleted the gh/XilunWu/88/head branch November 14, 2024 02:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor oncall: distributed Add this issue/PR to distributed oncall triage queue Stale

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants