feat: pass skip_sha256=True to hf_xet for bucket uploads#3900
Conversation
Bucket uploads don't need SHA-256 in the shard metadata (the sha_index GSI is only used for LFS pointer resolution, which doesn't apply to buckets). Pass skip_sha256=True to hf_xet.upload_files() and upload_bytes() in the bucket upload path to skip the SHA-256 computation, removing the main CPU bottleneck on non-SHA-NI instances. Depends on: huggingface/xet-core#679 Co-authored-by: Lucain <Wauplin@users.noreply.github.com>
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Replace the two mock-based tests with a single integration test that: - Creates a real Bucket on staging Hub - Uploads files from both filepath and bytes in a single batch - Wraps (not mocks) hf_xet.upload_files and hf_xet.upload_bytes to verify skip_sha256=True is passed - Verifies files are actually uploaded by listing the bucket tree Co-authored-by: Lucain <Wauplin@users.noreply.github.com>
|
Let's wait for next |
The test wraps the real hf_xet functions, so it fails when the installed hf_xet predates the skip_sha256 parameter (xet-core#679). Use inspect.signature to detect support and pytest.skip accordingly. Co-authored-by: Lucain <Wauplin@users.noreply.github.com>
|
cc @rajatarya for viz |
hf_xet.upload_files is a compiled built-in function, so inspect.signature() raises ValueError. Catch it and skip the test when the signature can't be introspected (older hf_xet). Co-authored-by: Lucain <Wauplin@users.noreply.github.com>
Use try/except TypeError around upload_files/upload_bytes calls with skip_sha256=True, falling back to calls without it for older hf_xet versions. TypeError for unknown kwargs on compiled functions is raised before any I/O, so the fallback is safe. Update test to check call_args_list[0] (the first attempt always includes skip_sha256=True) instead of requiring the function to accept it. Co-authored-by: Lucain <Wauplin@users.noreply.github.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
hanouticelina
left a comment
There was a problem hiding this comment.
looks good to me! thank you
|
I've updated the logic to remove the try/except (let's always consider Cannot merge until |
Wauplin
left a comment
There was a problem hiding this comment.
hf-xet 1.4.2 got released with the fix => will merge as soon as CI is ✔️
Bucket uploads don't need SHA-256 in the shard metadata (the sha_index GSI is only used for LFS pointer resolution, which doesn't apply to buckets). Pass skip_sha256=True to hf_xet.upload_files() and upload_bytes() in the bucket upload path to skip the SHA-256 computation, removing the main CPU bottleneck on non-SHA-NI instances.
Depends on: huggingface/xet-core#679
This PR is orthogonal to #3876 (which passes sha256 to hf-xet to avoid recomputation on model/dataset upload).
cc @XciD
Note
Medium Risk
Touches the bucket upload path and changes parameters passed to
hf_xet, which could affect upload metadata/compatibility if downstream expects SHA-256; scope is limited to buckets.Overview
Bucket uploads via Xet now skip SHA-256 computation. The bucket upload flow in
hf_api.pypassesskip_sha256=Truetohf_xet.upload_filesandhf_xet.upload_bytesto avoid hashing overhead.Adds an integration test (
TestBucketXetUploadSkipSha256) that spies onhf_xetcalls duringbatch_bucket_filesto assertskip_sha256is set for both file-path and in-memory byte uploads, and verifies the objects land in the bucket.Written by Cursor Bugbot for commit 1fddd7e. This will update automatically on new commits. Configure here.