Skip to content

Use dest_offsets directly in LoadPlanner#7155

Merged
jonb377 merged 3 commits intomasterfrom
jonbolin/dcp
Jun 11, 2024
Merged

Use dest_offsets directly in LoadPlanner#7155
jonb377 merged 3 commits intomasterfrom
jonbolin/dcp

Conversation

@jonb377
Copy link
Copy Markdown
Collaborator

@jonb377 jonb377 commented May 30, 2024

In create_read_items_for_chunk_list, the ReadItem's dest_offsets are set to index into the local shard. Despite this, we performed an translation from global to local indices. This caused issues restoring checkpoints with padded tensors, i.e. where the global tensor size is not evenly divisible by the shard size.

We should instead rely on dest_offsets directly to narrow the local shards.

@jonb377 jonb377 requested a review from alanwaketan May 30, 2024 17:33
@jonb377 jonb377 changed the title Jonbolin/dcp Use dest_offsets directly May 30, 2024
@jonb377 jonb377 changed the title Use dest_offsets directly Use dest_offsets directly in LoadPlanner May 30, 2024
@dasoto
Copy link
Copy Markdown

dasoto commented Jun 11, 2024

Any reason this has not been merged?

As context this also fixed an issue we have by saving a distributed checkpoint on a 8x16 and then loading it in other topology.

@jonb377
Copy link
Copy Markdown
Collaborator Author

jonb377 commented Jun 11, 2024

@alanwaketan Do you have a sec to review? I'll also add @JackCaoG

@jonb377 jonb377 requested a review from JackCaoG June 11, 2024 18:15
@JackCaoG
Copy link
Copy Markdown
Collaborator

does this change needs to go into 2.4 release(branch cut was yesterday)?

@jonb377 jonb377 merged commit 7be1d3d into master Jun 11, 2024
@jonb377
Copy link
Copy Markdown
Collaborator Author

jonb377 commented Jun 11, 2024

Yes, we should cherry-pick this. I'll open a PR.

@jonb377 jonb377 deleted the jonbolin/dcp branch June 11, 2024 19:55
Copy link
Copy Markdown
Collaborator

@alanwaketan alanwaketan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Sorry for missing this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants