[DCP][HF] [ez]Change where sharded tensors are saved#158069

Closed

ankitageorge wants to merge 1 commit intopytorch:mainfrom

ankitageorge:export-D78108144

Contributor

ankitageorge commented Jul 10, 2025 •

edited by pytorch-bot bot

Loading

Summary: Previously was saving sharded tensors to same directory as full tensors. But am realizing this doesn't make sense because on load(), you would be loading for a directory which contains both, with no way to distinguish them, so they should be in separate folders.

Test Plan:
ensure existing tests pass

Rollback Plan:

Differential Revision: D78108144

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta

pytorch-bot bot commented Jul 10, 2025 •

edited

Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/158069

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 1 Unrelated Failure

As of commit 66c81b9 with merge base 7d4228d ():

NEW FAILURES - The following jobs have failed:

Lint / lintrunner-mypy / linux-job (gh)
RuntimeError: Command docker exec -t 70fbc95973340ec2f28940ad7bc27b1f0f48a8e8b56788d9f403af31a4851b6a /exec failed with exit code 1
Lint / lintrunner-noclang / linux-job (gh)
RuntimeError: Command docker exec -t 02eb56aa2a37cfa1cef890130e4c32966f5f513f2a85b9f69d9fb3913bb25eab /exec failed with exit code 1

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / cuda12.8-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu, unstable) (gh) (#153987)
MISSING REGRESSION TEST

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot bot added oncall: distributed release notes: distributed (checkpoint) labels

Contributor

facebook-github-bot commented Jul 10, 2025

This pull request was exported from Phabricator. Differential Revision: D78108144

facebook-github-bot added the fb-exported label

ankitageorge force-pushed the export-D78108144 branch from d23ae54 to f920f12 Compare

July 10, 2025 21:51

ankitageorge force-pushed the export-D78108144 branch from f920f12 to b6878f3 Compare

July 10, 2025 21:51

Contributor

facebook-github-bot commented Jul 10, 2025

This pull request was exported from Phabricator. Differential Revision: D78108144

ankitageorge force-pushed the export-D78108144 branch from b6878f3 to 9abbc47 Compare

July 10, 2025 21:56

Contributor

facebook-github-bot commented Jul 10, 2025

This pull request was exported from Phabricator. Differential Revision: D78108144

ankitageorge force-pushed the export-D78108144 branch from 9abbc47 to b7e21a4 Compare

July 10, 2025 21:57

Contributor

facebook-github-bot commented Jul 10, 2025

This pull request was exported from Phabricator. Differential Revision: D78108144

ankitageorge force-pushed the export-D78108144 branch from b7e21a4 to 26eccbf Compare

July 10, 2025 22:08


          [DCP][HF] [ez]Change where sharded tensors are saved (pytorch#158069)

66c81b9

Summary:

Previously was saving sharded tensors to same directory as full tensors. But am realizing this doesn't make sense because on load(), you would be loading for a directory which contains both, with no way to distinguish them, so they should be in separate folders.

Test Plan:
ensure existing tests pass

Rollback Plan:

Differential Revision: D78108144

ankitageorge force-pushed the export-D78108144 branch from 26eccbf to 66c81b9 Compare

July 11, 2025 02:34

Contributor

facebook-github-bot commented Jul 11, 2025

This pull request was exported from Phabricator. Differential Revision: D78108144

teja-rao approved these changes

View reviewed changes

pytorch-bot bot added the ciflow/trunk label

Contributor Author

ankitageorge commented Jul 11, 2025

@pytorchmergebot merge -i

pytorchmergebot added the merging label

Collaborator

pytorchmergebot commented Jul 11, 2025

Merge started

Your change will be merged while ignoring the following 4 checks: pull / linux-jammy-py3.9-gcc11-no-ops / build, pull / linux-jammy-cuda12.8-py3.10-gcc11-sm89 / build, pull / cuda12.8-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu, unstable), Lint / lintrunner-clang / linux-job

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot added the Merged label

pytorchmergebot closed this in

627ba41

pytorchmergebot removed the merging label

Collaborator

jithunnair-amd commented Jul 18, 2025 •

edited

Loading

@pytorchbot revert -c nosignal -m "Didn't remove reference to consolidated_output_path in test_hf_safetensor_e2e.py; CUDA runs do not surface issue because safetensors is not installed and the test silently passes"

Error: TypeError: HuggingFaceStorageWriter.__init__() got an unexpected keyword argument 'consolidated_output_path'

pytorch/test/distributed/checkpoint/test_hf_safetensor_e2e.py

Line 163 in 1ab1ab3

consolidated_output_path=consolidated_output_dir,

Collaborator

pytorchmergebot commented Jul 18, 2025

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot added a commit that referenced this pull request


          Revert "[DCP][HF] [ez]Change where sharded tensors are saved (#158069)"

e3351b3

This reverts commit 627ba41.

Reverted #158069 on behalf of https://github.com/jithunnair-amd due to Didn't remove reference to `consolidated_output_path` in test_hf_safetensor_e2e.py; CUDA runs do not surface issue because safetensors is not installed and the test silently passes ([comment](#158069 (comment)))

Collaborator

pytorchmergebot commented Jul 18, 2025

@ankitageorge your PR has been successfully reverted.

pytorchmergebot added Reverted ci-no-td labels

pytorchmergebot reopened this

ankitageorge mentioned this pull request

DCP safetensors test fix #158685

Closed

pytorchmergebot pushed a commit that referenced this pull request


          DCP safetensors test fix (#158685)

4b02bd7

#158069 removed the consolidated output path argument without updating the test. Reported by a user here #156705 (comment).
Adding back the logic from the original PR #158069 and fixing the test.

Pull Request resolved: #158685
Approved by: https://github.com/teja-rao

ankitageorge closed this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-td ciflow/trunk fb-exported Merged oncall: distributed release notes: distributed (checkpoint) Reverted