Skip to content

feat: add skip_sha256 option to SingleFileCleaner#679

Merged
XciD merged 1 commit intomainfrom
feat/optional-sha256
Mar 10, 2026
Merged

feat: add skip_sha256 option to SingleFileCleaner#679
XciD merged 1 commit intomainfrom
feat/optional-sha256

Conversation

@XciD
Copy link
Member

@XciD XciD commented Mar 3, 2026

Summary

  • Add ShaGenerator::Skip variant that skips SHA-256 computation entirely
  • ShaGenerator::finalize() now returns Option<Sha256> (None when skipped)
  • SingleFileCleaner::new() and FileUploadSession::start_clean() accept a skip_sha256 boolean
  • When skipped, no FileMetadataExt is included in the shard

Context

Bucket uploads don't need SHA-256 in the shard metadata — the sha_index GSI is only used for LFS pointer resolution, which doesn't apply to buckets. Skipping SHA-256 for bucket uploads removes the main CPU bottleneck in the upload pipeline on non-SHA-NI instances.

Alternative: dummy SHA-256

Instead of skipping entirely, the client could send a zeroed/dummy FileMetadataExt. The server would still store it but queries would never match. This avoids the server-side schema change (xetcas PR) but pollutes the GSI with dummy entries.

Companion PRs:

  • xetcas: huggingface-internal/xetcas#498 (make FileIdItem.sha256 optional server-side)

@XciD XciD marked this pull request as ready for review March 3, 2026 06:22
@XciD XciD requested review from hoytak, rajatarya and seanses March 3, 2026 06:22
@XciD XciD changed the title Add skip_sha256 option to SingleFileCleaner feat: add skip_sha256 option to SingleFileCleaner Mar 3, 2026
Copy link
Collaborator

@rajatarya rajatarya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

Clean, well-scoped change. The ShaGenerator::Skip variant and Option<Sha256> return from finalize() are idiomatic. All existing callsites correctly pass false to preserve current behavior.

Key concerns:

  1. Contradictory parameter statesha256 = Some(...) + skip_sha256 = true is silently accepted with no guard
  2. API ergonomics — boolean param hurts readability at callsites; an enum collapsing both params would be cleaner
  3. Missing test coverage — no test exercises the skip_sha256 = true path
  4. WASM parityhf_xet_wasm/src/wasm_file_cleaner.rs always computes SHA-256 and always passes Some(metadata_ext) to finalize(). If bucket uploads can go through WASM, this is a gap. If intentional, worth a comment in the code documenting the divergence.
  5. Deployment ordering — companion xetcas#498 needs to land first or the server will reject shards without SHA-256

See inline comments for details.

@XciD
Copy link
Member Author

XciD commented Mar 3, 2026

Addressed all comments:

  • Replaced (sha256: Option<Sha256>, skip_sha256: bool) with a Sha256Policy enum (Compute, Provided, Skip)
  • Added test for ShaGenerator::Skip
  • Added a comment documenting the WASM divergence

Out of curiosity — who actually uses the WASM upload path today?

@seanses
Copy link
Collaborator

seanses commented Mar 3, 2026

Out of curiosity — who actually uses the WASM upload path today?

Not by any official clients, but some guy showed up recently and fixed our wasm build script -- apparently he was experimenting with it.

data/src/lib.rs Outdated
mod file_cleaner;
mod file_download_session;
mod file_upload_session;
pub mod file_upload_session;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of making file_upload_session be a pub mod, let's just add pub use Sha256Policy; here.

@hoytak
Copy link
Collaborator

hoytak commented Mar 3, 2026

It's a weird abstraction that the ShaGenerator class actually handles the logic of whether or not it gets computed. I think we should leave it as it is and move all of that logic to the SingleFileCleaner class, making the sha_generator : ShaGenerator there be sha_generator : Option<ShaGenerator> with provided_sha : Option<...> and making the computation logic around what happens explicit. It'll just mean adding a few if let Some(...) but it'll make it cleaner and more readable.

Then move the Sha256Policy to file_cleaner.rs.

@rajatarya
Copy link
Collaborator

Cross-Cutting Review: xet-core#679 + xetcas#498

These two PRs form a paired change: the client (this PR) stops sending SHA-256 in shard metadata for bucket uploads, and the server (xetcas#498) makes FileIdItem.sha256 optional so it can accept those shards. Here's an analysis of the concerns that span both PRs.


1. Deployment Ordering — Hard Dependency

Risk: HIGH

The server must be deployed before any client running this code. The current flow:

  • Client (this PR): When skip_sha256 = true, FileMetadataExt is omitted from the shard binary. The FileDataSequenceHeader.contains_metadata_ext() flag is false, so the shard doesn't contain the SHA section at all.
  • Server (xetcas#498): read_file_info_ranges() already returns Option<MerkleHash> — it respects the contains_metadata_ext() flag. However, the old server code at database.rs:294-296 treats a missing SHA as a 400 error:
    let sha256 = sha.ok_or_else(|| {
        CasError::bad_request(format!("invalid shard: sha for file: {file_id:x} is missing"))
    })?;
    So a new client sending SHA-less shards to the old server will get rejected.

Recommendation: Both PRs should document the deployment order explicitly. Consider a feature flag or version negotiation so the client only skips SHA-256 when it knows the server supports it. At minimum, the rollout plan should be:

  1. Deploy xetcas#498 to all server instances
  2. Only then release a client version with this PR

2. Wire Format Compatibility — Already Sound

Risk: LOW

The shard binary format is already compatible. FileDataSequenceHeader has a contains_metadata_ext() bit flag that controls whether the FileMetadataExt section is present. The server's read_file_info_ranges() (in mdb_shard/src/shard_format.rs:917) already checks this flag and returns None when the section is absent. This is not a new format — it's an existing capability that was previously unused because the client always sent metadata.


3. No End-to-End Integration Test

Risk: MEDIUM

Neither PR includes a test that exercises the full path: client creates a shard without SHA-256 → server parses it → DynamoDB item has sha256: None → GSI doesn't index it → LFS pointer resolution still works for other files.

The closest tests are:

  • xet-core#679: No test with skip_sha256 = true at all
  • xetcas#498: Tests update existing Some(sha256) paths but don't test sha256: None items against the GSI

Recommendation: At minimum, add a server-side test in xetcas that constructs a shard binary without FileMetadataExt, calls sync_shard_impl, and verifies the resulting FileIdItem has sha256: None and doesn't appear in GSI queries.


4. Semantic Gap in sizzle_sync

Risk: MEDIUM

In xetcas#498, sizzle_sync/src/commands/dynamodb_sync.rs handles missing SHA-256 with unwrap_or_default(), converting None to "". But the client-side never sends an empty string — it either sends a valid SHA-256 hex string or omits the field entirely. The empty string "" is a third state that doesn't exist in the source data.

If sizzle_sync downstream code does string comparisons, joins, or lookups using the sha256 field, "" will behave differently than a proper None/absent value. This is a cross-repo semantic leak.


5. No Caller of skip_sha256 = true in This PR

Risk: LOW (intentional)

This PR adds the skip_sha256 parameter but every callsite passes false. The actual bucket-upload integration that passes true is presumably in a follow-up PR. This is fine for incremental merging, but means the feature is untestable in production until that follow-up lands. Consider noting the follow-up PR in the description so reviewers know where the feature gets activated.


6. WASM Client Divergence

Risk: LOW-MEDIUM

The WASM client (hf_xet_wasm/src/wasm_file_cleaner.rs) has its own SingleFileCleaner that always computes SHA-256 and always passes Some(metadata_ext) to finalize(). If bucket uploads ever go through the WASM path, they'll still compute SHA-256 unnecessarily. The two SingleFileCleaner implementations are diverging — the native one now has skip_sha256 but the WASM one doesn't. This should be documented, and if WASM bucket uploads are planned, the WASM cleaner should be updated in a follow-up.


Summary Table

Concern Risk Mitigation
Deployment ordering HIGH Document order; consider feature flag
Wire format compat LOW Already supported by shard format
No E2E integration test MEDIUM Add server-side test with SHA-less shard
sizzle_sync empty string MEDIUM Make field Option<String> or document why "" is safe
No caller passes true yet LOW Note follow-up PR in description
WASM divergence LOW-MEDIUM Document; update WASM if needed

@XciD XciD force-pushed the feat/optional-sha256 branch 9 times, most recently from 6552b84 to 21661b4 Compare March 4, 2026 08:05
Allow callers to skip SHA-256 computation entirely by passing
skip_sha256=true. When skipped, no metadata_ext is included in the
shard. This is useful for bucket uploads where the CAS sha_index
is not queried and the SHA-256 overhead is significant.

refactor: replace (sha256, skip_sha256) params with Sha256Policy enum

Collapse the contradictory (sha256: Option<Sha256>, skip_sha256: bool)
parameter pair into a single Sha256Policy enum with Compute, Provided,
and Skip variants. This eliminates the invalid state at the type level
and makes callsites self-documenting.

Also adds test coverage for the ShaGenerator::Skip path and a comment
documenting the WASM parity gap.

refactor: move SHA-256 logic from ShaGenerator into SingleFileCleaner

Remove the ShaGenerator enum and make the computation logic explicit
in SingleFileCleaner using Option<Sha256Generator> + Option<Sha256>.
Move Sha256Policy to file_cleaner.rs.

test: add unit tests for Sha256Policy::Skip variant
@XciD XciD force-pushed the feat/optional-sha256 branch from 21661b4 to 3652ea0 Compare March 8, 2026 12:27
Copy link
Collaborator

@seanses seanses left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@XciD XciD requested a review from hoytak March 9, 2026 07:04
cursor bot pushed a commit to huggingface/huggingface_hub that referenced this pull request Mar 10, 2026
Bucket uploads don't need SHA-256 in the shard metadata (the sha_index
GSI is only used for LFS pointer resolution, which doesn't apply to
buckets). Pass skip_sha256=True to hf_xet.upload_files() and
upload_bytes() in the bucket upload path to skip the SHA-256
computation, removing the main CPU bottleneck on non-SHA-NI instances.

Depends on: huggingface/xet-core#679

Co-authored-by: Lucain <Wauplin@users.noreply.github.com>
Copy link
Collaborator

@hoytak hoytak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@XciD XciD merged commit a48f1f8 into main Mar 10, 2026
7 checks passed
@XciD XciD deleted the feat/optional-sha256 branch March 10, 2026 16:36
Wauplin added a commit to huggingface/huggingface_hub that referenced this pull request Mar 13, 2026
* feat: pass skip_sha256=True to hf_xet for bucket uploads

Bucket uploads don't need SHA-256 in the shard metadata (the sha_index
GSI is only used for LFS pointer resolution, which doesn't apply to
buckets). Pass skip_sha256=True to hf_xet.upload_files() and
upload_bytes() in the bucket upload path to skip the SHA-256
computation, removing the main CPU bottleneck on non-SHA-NI instances.

Depends on: huggingface/xet-core#679

Co-authored-by: Lucain <Wauplin@users.noreply.github.com>

* test: use real bucket upload instead of mocks for skip_sha256 test

Replace the two mock-based tests with a single integration test that:
- Creates a real Bucket on staging Hub
- Uploads files from both filepath and bytes in a single batch
- Wraps (not mocks) hf_xet.upload_files and hf_xet.upload_bytes to
  verify skip_sha256=True is passed
- Verifies files are actually uploaded by listing the bucket tree

Co-authored-by: Lucain <Wauplin@users.noreply.github.com>

* test: skip skip_sha256 test when hf_xet doesn't support it yet

The test wraps the real hf_xet functions, so it fails when the
installed hf_xet predates the skip_sha256 parameter (xet-core#679).
Use inspect.signature to detect support and pytest.skip accordingly.

Co-authored-by: Lucain <Wauplin@users.noreply.github.com>

* test: handle built-in functions in skip_sha256 signature check

hf_xet.upload_files is a compiled built-in function, so
inspect.signature() raises ValueError. Catch it and skip the test
when the signature can't be introspected (older hf_xet).

Co-authored-by: Lucain <Wauplin@users.noreply.github.com>

* fix: gracefully fall back when hf_xet lacks skip_sha256 support

Use try/except TypeError around upload_files/upload_bytes calls with
skip_sha256=True, falling back to calls without it for older hf_xet
versions. TypeError for unknown kwargs on compiled functions is raised
before any I/O, so the fallback is safe.

Update test to check call_args_list[0] (the first attempt always
includes skip_sha256=True) instead of requiring the function to
accept it.

Co-authored-by: Lucain <Wauplin@users.noreply.github.com>

* better like this

---------

Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: Lucain <Wauplin@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants