Skip to content

feat: accept pre-computed SHA-256 in upload_files()#678

Merged
XciD merged 4 commits intomainfrom
feat/pass-sha256-to-upload
Mar 3, 2026
Merged

feat: accept pre-computed SHA-256 in upload_files()#678
XciD merged 4 commits intomainfrom
feat/pass-sha256-to-upload

Conversation

@XciD
Copy link
Member

@XciD XciD commented Mar 3, 2026

Summary

  • Add optional sha256s keyword parameter to the Python-exposed upload_files() function
  • Forward it to data_client::upload_async() which already supports it

Context

Double computation today

huggingface_hub computes SHA-256 on every file during CommitOperationAdd.__post_init__() for LFS batch negotiation, then hf_xet recomputes it internally because upload_files() doesn't accept pre-computed hashes.

Performance impact

This change eliminates the redundant computation entirely.

Backward compatibility

  • sha256s is a keyword-only parameter with default None — no change for existing callers
  • data_client::upload_async() already accepts sha256s: Option<Vec<String>> since day one
  • When provided, SingleFileCleaner uses ShaGenerator::ProvidedValue and skips internal recomputation

Companion PR: huggingface/huggingface_hub#3876

Add optional `sha256s` keyword parameter to the Python-exposed
`upload_files()` function and forward it to `data_client::upload_async()`.

When provided, `SingleFileCleaner` uses `ShaGenerator::ProvidedValue`
and skips internal SHA-256 recomputation — the mechanism already exists,
but the Python binding was hardcoding `None`.
@XciD XciD marked this pull request as ready for review March 3, 2026 05:20
@XciD XciD requested review from hoytak and seanses March 3, 2026 05:20
@Wauplin
Copy link
Collaborator

Wauplin commented Mar 3, 2026

Really nice finding! I reviewed the huggingface_hub related PR which looks good to me. We are planning a release by the end of the week so if it can be included in it it's good (otherwise no problem^^).

(orthogonal but #679 would also be nice to have for Buckets upload 😃 )

@XciD XciD changed the title Accept pre-computed SHA-256 in upload_files() feat: accept pre-computed SHA-256 in upload_files() Mar 3, 2026
Copy link
Collaborator

@rajatarya rajatarya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good if we have some unit-tests for this case, but otherwise looks good.

We should have some validation when provided that there is a 1:1 match in lengths between file_paths and sha256s.

Copy link
Collaborator

@hoytak hoytak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Add early validation in upload_files() to reject mismatched sha256s/file_paths
lengths with a clear Python error message. Add unit tests for the validation.

Addresses review feedback from #678.
@XciD
Copy link
Member Author

XciD commented Mar 3, 2026

It would be good if we have some unit-tests for this case, but otherwise looks good.

We should have some validation when provided that there is a 1:1 match in lengths between file_paths and sha256s.

Yeah you right, catching the array length is easy check. Also added a test.
testing that sha is not compute is not trivial

@XciD XciD merged commit 40b45fb into main Mar 3, 2026
7 checks passed
@XciD XciD deleted the feat/pass-sha256-to-upload branch March 3, 2026 20:20
XciD added a commit that referenced this pull request Mar 12, 2026
## Summary

Add `skip_sha256` and `sha256s` parameters to `upload_bytes()` Python
binding for per-file SHA-256 policies:
- `skip_sha256: bool = False` - Skip SHA-256 computation entirely (sets
`Sha256Policy::Skip`)
- `sha256s: Optional[List[str]] = None` - Provide pre-computed SHA-256
hashes (companion to existing parameter on `upload_files()`)
- These parameters are mutually exclusive

## Changes

**Python binding changes:**
- Add `skip_sha256` + `sha256s` params to `upload_bytes()` /
`upload_files()`
- All policy conversion happens at Python boundary

**Internal refactoring:**
- Add `Clone`/`Copy` derives + `from_skip()`/`from_hex()` helpers to
`Sha256Policy`
- Update `upload_bytes_async`, `upload_async`, `clean_file` to use
`Vec<Sha256Policy>`
- Update all internal callers across `git_xet`, `xet_pkg`, migration
tool, tests

## Motivation

`huggingface_hub` already knows whether SHA-256 is required. This change
enables skipping expensive computation when unnecessary, or passing
pre-computed hashes for bulk operations.

Companion to #678.

---------

Co-authored-by: Wauplin <lucainp@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants