Skip to content

Revamp download to local dir process#2223

Merged
Wauplin merged 43 commits into
mainfrom
1738-revampt-download-local-dir
Apr 29, 2024
Merged

Revamp download to local dir process#2223
Wauplin merged 43 commits into
mainfrom
1738-revampt-download-local-dir

Conversation

@Wauplin

@Wauplin Wauplin commented Apr 12, 2024

Copy link
Copy Markdown
Collaborator

Implements #1738 (and especially #1738 (comment)) 🙈

What does this PR do?

  • Downloading files with local_dir do not use the cache but rely on a .huggingface/ folder instead
  • Prevent .huggingface/ from being committed
  • Adapt CLI
  • Revampt all "local dir" tests
  • All downloads (to local dir or to cache) will not resume by default any incomplete downloads. This should prove useful on slow connections. IMO there is very little benefit to make this optional. Only counter-example: if hf_transfer is enabled, we do not resume download (not supported). One can use force_download to force a download from scratch.

How it works?

When downloading a file file.txt to local dir data/:

  1. Checks if:
    1. data/file.txt exists
    2. and data/.huggingface/download/file.txt.metadata exists
    3. and data/file.txt has been modified before the metadata file was saved (metadata contains a timestamp)
  2. If (1.) is true, then we consider the metadata as valid. Metadata contains commit_hash and etag. Otherwise we consider that we don't have any info on the local file.
  3. if revision == metadata.commit_hash, then the file is valid => return
  4. otherwise, we fetch remote metadata
    1. If remote etag == metadata.etag => update local metadata => return
    2. if remote tag is a sha256 and we don't have local metadata => we hash local file. If sha256 == remote etag => it's a valid LFS file => return
    3. Otherwise (etag mismatch and not LFS or LFS but sha256 mismatch) => download file

If force_download=True is passed, all of the above is skipped => we download the file no matter what.

What to review?

This is a large PR (+1,265 −760) as it touches in depth the download logic. However a lot of the changes are about moving parts of code into private helpers to avoid duplicating the logic between _hf_hub_download_local_dir and _hf_hub_download_cache_dir.

Important changes are:

  • logic inside _local_folder.py => handles metadata in the local folder (i.e. inside ./huggingface/)
  • file_download.py => where everything happens. Best to read the file instead of raw changes. Most important part is _hf_hub_download_local_dir while hf_hub_download and _hf_hub_download_cache_dir are iso-feature compared to before.
  • test_file_download.py: all the new test cases in HfHubDownloadToLocalDir

Doc changes:

  • cli.md, download.md, environment_variables.md
  • snapshot_download.py => only some docs + few tweaks, no real update
  • hf_api.py => only some docs + few tweaks, no real update

Less important:

  • logic to prevent from committing huggingface/ folder (_commit_api.py + test_commit_api.py + test_utils_paths.py)
  • test_cli.py => not relevant
  • command/download.py => deprecated --local-dir-use-symlinks in CLI
  • constants.py/hub_mixin.py/keras_mixin.py => some deprecation

Example

Download README.md and model.safetensors from gpt2 repo into ./data/gpt2 folder:

huggingface-cli download gpt2 README.md model.safetensors --local-dir=data/gpt2

Resulting tree:

# tree -alh data/
[4.0K]  data/
└── [4.0K]  gpt2
    ├── [4.0K]  .huggingface
    │   ├── [4.0K]  download
    │   │   ├── [   0]  model.safetensors.lock
    │   │   ├── [ 182]  model.safetensors.metadata
    │   │   ├── [   0]  README.md.lock
    │   │   └── [ 158]  README.md.metadata
    │   └── [   1]  .gitignore
    ├── [523M]  model.safetensors
    └── [7.9K]  README.md

3 directories, 7 files

How to try it?

pip install git+https://github.com/huggingface/huggingface_hub@1738-revampt-download-local-dir

TODO

  • add an explanation guide on "how .huggingface/ works?" (similar to the "Cache" guide?)
  • extensively test corner cases

@HuggingFaceDocBuilderDev

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@Wauplin Wauplin changed the title [wip ] Revampt download to local dir process Revampt download to local dir process Apr 24, 2024
@Wauplin Wauplin marked this pull request as ready for review April 24, 2024 16:25

@LysandreJik LysandreJik left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Super nice side effect of having resume_downloads on by default.
I played with it locally and we already discussed potential improvements.

The rest LGTM!

Comment thread src/huggingface_hub/_local_folder.py Outdated
Comment thread src/huggingface_hub/commands/download.py Outdated
Comment thread src/huggingface_hub/commands/download.py Outdated
Comment thread src/huggingface_hub/file_download.py Outdated
Comment thread src/huggingface_hub/_snapshot_download.py Outdated
Comment thread docs/source/en/package_reference/environment_variables.md Outdated
Wauplin and others added 2 commits April 25, 2024 14:29
Co-authored-by: Lysandre Debut <hi@lysand.re>
Co-authored-by: Lysandre Debut <hi@lysand.re>

@pcuenca pcuenca left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! (from a distance :))

Comment thread docs/source/en/guides/download.md Outdated
Comment thread src/huggingface_hub/commands/download.py
Comment thread src/huggingface_hub/file_download.py
Comment thread src/huggingface_hub/commands/download.py Outdated
Wauplin and others added 3 commits April 29, 2024 08:04
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
@Wauplin

Wauplin commented Apr 29, 2024

Copy link
Copy Markdown
Collaborator Author

Thanks @stevhliu @pcuenca for the review! I like it better like this :D

@Wauplin

Wauplin commented Apr 29, 2024

Copy link
Copy Markdown
Collaborator Author

Addressed all comments and fixed the CI (failures were not only due to this PR but I fixed them here anyway).

Let's get this merged! 🎉

@Wauplin Wauplin merged commit 4df59b4 into main Apr 29, 2024
@Wauplin Wauplin deleted the 1738-revampt-download-local-dir branch April 29, 2024 08:50
@Wauplin Wauplin changed the title Revampt download to local dir process Revamp download to local dir process May 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants