feat: accept pre-computed SHA-256 in upload_files()#678
Conversation
Add optional `sha256s` keyword parameter to the Python-exposed `upload_files()` function and forward it to `data_client::upload_async()`. When provided, `SingleFileCleaner` uses `ShaGenerator::ProvidedValue` and skips internal SHA-256 recomputation — the mechanism already exists, but the Python binding was hardcoding `None`.
|
Really nice finding! I reviewed the (orthogonal but #679 would also be nice to have for Buckets upload 😃 ) |
rajatarya
left a comment
There was a problem hiding this comment.
It would be good if we have some unit-tests for this case, but otherwise looks good.
We should have some validation when provided that there is a 1:1 match in lengths between file_paths and sha256s.
Add early validation in upload_files() to reject mismatched sha256s/file_paths lengths with a clear Python error message. Add unit tests for the validation. Addresses review feedback from #678.
Yeah you right, catching the array length is easy check. Also added a test. |
## Summary Add `skip_sha256` and `sha256s` parameters to `upload_bytes()` Python binding for per-file SHA-256 policies: - `skip_sha256: bool = False` - Skip SHA-256 computation entirely (sets `Sha256Policy::Skip`) - `sha256s: Optional[List[str]] = None` - Provide pre-computed SHA-256 hashes (companion to existing parameter on `upload_files()`) - These parameters are mutually exclusive ## Changes **Python binding changes:** - Add `skip_sha256` + `sha256s` params to `upload_bytes()` / `upload_files()` - All policy conversion happens at Python boundary **Internal refactoring:** - Add `Clone`/`Copy` derives + `from_skip()`/`from_hex()` helpers to `Sha256Policy` - Update `upload_bytes_async`, `upload_async`, `clean_file` to use `Vec<Sha256Policy>` - Update all internal callers across `git_xet`, `xet_pkg`, migration tool, tests ## Motivation `huggingface_hub` already knows whether SHA-256 is required. This change enables skipping expensive computation when unnecessary, or passing pre-computed hashes for bulk operations. Companion to #678. --------- Co-authored-by: Wauplin <lucainp@gmail.com>
Summary
sha256skeyword parameter to the Python-exposedupload_files()functiondata_client::upload_async()which already supports itContext
Double computation today
huggingface_hubcomputes SHA-256 on every file duringCommitOperationAdd.__post_init__()for LFS batch negotiation, thenhf_xetrecomputes it internally becauseupload_files()doesn't accept pre-computed hashes.Performance impact
This change eliminates the redundant computation entirely.
Backward compatibility
sha256sis a keyword-only parameter with defaultNone— no change for existing callersdata_client::upload_async()already acceptssha256s: Option<Vec<String>>since day oneSingleFileCleanerusesShaGenerator::ProvidedValueand skips internal recomputationCompanion PR: huggingface/huggingface_hub#3876