Skip to content

feat: pass pre-computed SHA-256 to hf_xet upload#3876

Merged
Wauplin merged 7 commits intomainfrom
feat/pass-sha256-to-xet
Mar 13, 2026
Merged

feat: pass pre-computed SHA-256 to hf_xet upload#3876
Wauplin merged 7 commits intomainfrom
feat/pass-sha256-to-xet

Conversation

@XciD
Copy link
Member

@XciD XciD commented Mar 3, 2026

Summary

Pass the SHA-256 hashes already computed during CommitOperationAdd.__post_init__() (via UploadInfo.from_path()) to hf_xet.upload_files() via the new sha256s keyword parameter.

Context

Double computation today

For repo commits, huggingface_hub computes SHA-256 on every file for LFS batch negotiation, then hf_xet recomputes it internally because upload_files() doesn't accept pre-computed hashes:

CommitOperationAdd.__post_init__()
  → UploadInfo.from_path()
    → sha_fileobj()          ← SHA-256 #1

upload_files(paths, ...)     ← sha256s not passed
  → SingleFileCleaner
    → ShaGenerator::Generate ← SHA-256 #2 (same bytes, same result)

Performance impact

On instances without SHA-NI (e.g. AWS m5.xlarge), SHA-256 runs at ~280-310 MB/s in software and accounts for 70-80% of the upload pipeline CPU time. This eliminates the redundant computation.

Scope

Only the repo commit path (_upload_xet_files in _commit_api.py) is changed, where UploadInfo.sha256 is already available.

The bucket path (hf_api.py:_batch_bucket_files) does not compute SHA-256 upfront, so it is not changed here.

Depends on: huggingface/xet-core#678


Note

Medium Risk
Changes the Xet upload code path and bumps the hf-xet dependency, so incompatibilities or mismatched hash ordering could impact uploads if the new API behaves unexpectedly.

Overview
Xet-backed commit uploads now forward precomputed SHA-256 hashes to hf_xet.upload_files()/upload_bytes() via a new sha256s argument, avoiding redundant hashing during _upload_xet_files.

Also bumps the hf-xet dependency to >=1.4.2 and extends test_xet_upload to assert the sha256s values are passed through alongside the existing header filtering.

Written by Cursor Bugbot for commit d799e2d. This will update automatically on new commits. Configure here.

@bot-ci-comment
Copy link

bot-ci-comment bot commented Mar 3, 2026

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR eliminates a redundant SHA-256 computation in the Xet repo-commit upload path. During a commit, CommitOperationAdd.__post_init__() already computes a SHA-256 hash for every file (via UploadInfo.from_path()). Previously, hf_xet.upload_files() would compute that same hash a second time internally. This PR passes the already-computed hashes to upload_files() via its new sha256s keyword parameter, halving SHA-256 work on the repo-commit path.

Changes:

  • Builds all_sha256s as a list of hex-encoded SHA-256 strings derived from op.upload_info.sha256 for all path-based upload operations.
  • Passes sha256s=all_sha256s as a new keyword argument to hf_xet.upload_files().

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice finding! Looks good, let's just wait for huggingface/xet-core#678 to be merged and shipped so we can bump hf_xet in the dependencies

@XciD XciD changed the title Pass pre-computed SHA-256 to hf_xet upload feat: pass pre-computed SHA-256 to hf_xet upload Mar 3, 2026
XciD added a commit to huggingface/xet-core that referenced this pull request Mar 3, 2026
## Summary

- Add optional `sha256s` keyword parameter to the Python-exposed
`upload_files()` function
- Forward it to `data_client::upload_async()` which already supports it

## Context

### Double computation today

`huggingface_hub` computes SHA-256 on every file during
`CommitOperationAdd.__post_init__()` for LFS batch negotiation, then
`hf_xet` recomputes it internally because `upload_files()` doesn't
accept pre-computed hashes.

### Performance impact

This change eliminates the redundant computation entirely.

### Backward compatibility

- `sha256s` is a keyword-only parameter with default `None` — no change
for existing callers
- `data_client::upload_async()` already accepts `sha256s:
Option<Vec<String>>` since day one
- When provided, `SingleFileCleaner` uses `ShaGenerator::ProvidedValue`
and skips internal recomputation

Companion PR: huggingface/huggingface_hub#3876
Copy link
Contributor

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good now that https://github.com/huggingface/xet-core/releases/tag/v1.4.0 is shipped! I've bumped the minimal version

Copy link
Contributor

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's wait for a new hf-xet patch since upload_bytes doesn't have the sha256s parameter yet

Copy link
Contributor

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Waiting for CI and we should be good to merge

@Wauplin Wauplin merged commit 141fcfd into main Mar 13, 2026
21 of 22 checks passed
@Wauplin Wauplin deleted the feat/pass-sha256-to-xet branch March 13, 2026 08:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants