Avoid reuploading preuploaded LFS files in upload-large-folder#4165
Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes upload_large_folder resume behavior by restoring the “already preuploaded” LFS state when rebuilding commit operations from local metadata, preventing redundant LFS uploads during the commit phase.
Changes:
- Restore
CommitOperationAdd._is_uploadedfromLocalUploadFileMetadata.is_uploadedwhen building operations. - For already-uploaded LFS files, clear
path_or_fileobjto match the post-upload “free memory” behavior. - Add a regression test ensuring
_build_hacky_operationpreserves LFS preupload state.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
src/huggingface_hub/_upload_large_folder.py |
Propagates is_uploaded into rebuilt commit operations and clears content for preuploaded LFS items to avoid reupload. |
tests/test_upload_large_folder.py |
Adds coverage for preserving LFS preupload state when rebuilding operations from local metadata. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #4165 +/- ##
==========================================
+ Coverage 75.00% 76.97% +1.96%
==========================================
Files 145 168 +23
Lines 13978 19060 +5082
==========================================
+ Hits 10484 14671 +4187
- Misses 3494 4389 +895 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Wauplin
left a comment
There was a problem hiding this comment.
Thanks for finding this out and the PR @Dev-Jahn ! I don't get why we never caught this issue but after carefully reviewing this PR + #2254 and #3100, I can confirm this looks like a complete oversight from us. And thanks for the minimal test as well :)
Note that if the metadata is older than 20h, we automatically set is_uploaded to False even if metadata says otherwise because the uploaded Xet file might have been garbage-collected in the meantime (if uploaded but not committed).
Anyway, thanks for the fix!
|
This PR has been shipped as part of the v1.13.0 release. |
What does this PR do?
upload_large_folder stores whether a file has already been preuploaded in LocalUploadFileMetadata.is_uploaded, but _build_hacky_operation did not restore that state on the CommitOperationAdd created for the commit step.
As a result, create_commit could call preupload_lfs_files again for files that were already uploaded by the large-folder preupload step. This makes the commit phase unnecessarily re-read/re-upload LFS files after resuming or after preupload completion.
This PR restores _is_uploaded from the local metadata and clears path_or_fileobj for already uploaded LFS files, matching the behavior of preupload_lfs_files after a successful upload.
Tests