Revamp download to local dir process#2223
Merged
Merged
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
LysandreJik
approved these changes
Apr 25, 2024
LysandreJik
left a comment
Member
There was a problem hiding this comment.
Awesome! Super nice side effect of having resume_downloads on by default.
I played with it locally and we already discussed potential improvements.
The rest LGTM!
Co-authored-by: Lysandre Debut <hi@lysand.re>
Co-authored-by: Lysandre Debut <hi@lysand.re>
pcuenca
reviewed
Apr 27, 2024
pcuenca
left a comment
Member
There was a problem hiding this comment.
Looks great! (from a distance :))
Collaborator
Author
Collaborator
Author
|
Addressed all comments and fixed the CI (failures were not only due to this PR but I fixed them here anyway). Let's get this merged! 🎉 |
7 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Implements #1738 (and especially #1738 (comment)) 🙈
What does this PR do?
local_dirdo not use the cache but rely on a.huggingface/folder instead.huggingface/from being committedhf_transferis enabled, we do not resume download (not supported). One can useforce_downloadto force a download from scratch.How it works?
When downloading a file
file.txtto local dirdata/:data/file.txtexistsdata/.huggingface/download/file.txt.metadataexistsdata/file.txthas been modified before the metadata file was saved (metadata contains a timestamp)commit_hashandetag. Otherwise we consider that we don't have any info on the local file.revision == metadata.commit_hash, then the file is valid => returnremote etag == metadata.etag=> update local metadata => returnremote tagis a sha256 and we don't have local metadata => we hash local file. If sha256 == remote etag => it's a valid LFS file => returnIf
force_download=Trueis passed, all of the above is skipped => we download the file no matter what.What to review?
This is a large PR (
+1,265 −760) as it touches in depth the download logic. However a lot of the changes are about moving parts of code into private helpers to avoid duplicating the logic between_hf_hub_download_local_dirand_hf_hub_download_cache_dir.Important changes are:
_local_folder.py=> handles metadata in the local folder (i.e. inside./huggingface/)file_download.py=> where everything happens. Best to read the file instead of raw changes. Most important part is_hf_hub_download_local_dirwhilehf_hub_downloadand_hf_hub_download_cache_dirare iso-feature compared to before.test_file_download.py: all the new test cases inHfHubDownloadToLocalDirDoc changes:
cli.md,download.md,environment_variables.mdsnapshot_download.py=> only some docs + few tweaks, no real updatehf_api.py=> only some docs + few tweaks, no real updateLess important:
huggingface/folder (_commit_api.py+test_commit_api.py+test_utils_paths.py)test_cli.py=> not relevantcommand/download.py=> deprecated--local-dir-use-symlinksin CLIconstants.py/hub_mixin.py/keras_mixin.py=> some deprecationExample
Download
README.mdandmodel.safetensorsfromgpt2repo into./data/gpt2folder:Resulting tree:
# tree -alh data/ [4.0K] data/ └── [4.0K] gpt2 ├── [4.0K] .huggingface │ ├── [4.0K] download │ │ ├── [ 0] model.safetensors.lock │ │ ├── [ 182] model.safetensors.metadata │ │ ├── [ 0] README.md.lock │ │ └── [ 158] README.md.metadata │ └── [ 1] .gitignore ├── [523M] model.safetensors └── [7.9K] README.md 3 directories, 7 filesHow to try it?
TODO
.huggingface/works?" (similar to the "Cache" guide?)