feat: add skip_sha256 option to SingleFileCleaner#679
Conversation
rajatarya
left a comment
There was a problem hiding this comment.
Review Summary
Clean, well-scoped change. The ShaGenerator::Skip variant and Option<Sha256> return from finalize() are idiomatic. All existing callsites correctly pass false to preserve current behavior.
Key concerns:
- Contradictory parameter state —
sha256 = Some(...)+skip_sha256 = trueis silently accepted with no guard - API ergonomics — boolean param hurts readability at callsites; an enum collapsing both params would be cleaner
- Missing test coverage — no test exercises the
skip_sha256 = truepath - WASM parity —
hf_xet_wasm/src/wasm_file_cleaner.rsalways computes SHA-256 and always passesSome(metadata_ext)tofinalize(). If bucket uploads can go through WASM, this is a gap. If intentional, worth a comment in the code documenting the divergence. - Deployment ordering — companion xetcas#498 needs to land first or the server will reject shards without SHA-256
See inline comments for details.
|
Addressed all comments:
Out of curiosity — who actually uses the WASM upload path today? |
Not by any official clients, but some guy showed up recently and fixed our wasm build script -- apparently he was experimenting with it. |
data/src/lib.rs
Outdated
| mod file_cleaner; | ||
| mod file_download_session; | ||
| mod file_upload_session; | ||
| pub mod file_upload_session; |
There was a problem hiding this comment.
Instead of making file_upload_session be a pub mod, let's just add pub use Sha256Policy; here.
|
It's a weird abstraction that the ShaGenerator class actually handles the logic of whether or not it gets computed. I think we should leave it as it is and move all of that logic to the SingleFileCleaner class, making the Then move the Sha256Policy to file_cleaner.rs. |
Cross-Cutting Review: xet-core#679 + xetcas#498These two PRs form a paired change: the client (this PR) stops sending SHA-256 in shard metadata for bucket uploads, and the server (xetcas#498) makes 1. Deployment Ordering — Hard DependencyRisk: HIGH The server must be deployed before any client running this code. The current flow:
Recommendation: Both PRs should document the deployment order explicitly. Consider a feature flag or version negotiation so the client only skips SHA-256 when it knows the server supports it. At minimum, the rollout plan should be:
2. Wire Format Compatibility — Already SoundRisk: LOW The shard binary format is already compatible. 3. No End-to-End Integration TestRisk: MEDIUM Neither PR includes a test that exercises the full path: client creates a shard without SHA-256 → server parses it → DynamoDB item has The closest tests are:
Recommendation: At minimum, add a server-side test in xetcas that constructs a shard binary without 4. Semantic Gap in
|
| Concern | Risk | Mitigation |
|---|---|---|
| Deployment ordering | HIGH | Document order; consider feature flag |
| Wire format compat | LOW | Already supported by shard format |
| No E2E integration test | MEDIUM | Add server-side test with SHA-less shard |
| sizzle_sync empty string | MEDIUM | Make field Option<String> or document why "" is safe |
No caller passes true yet |
LOW | Note follow-up PR in description |
| WASM divergence | LOW-MEDIUM | Document; update WASM if needed |
6552b84 to
21661b4
Compare
Allow callers to skip SHA-256 computation entirely by passing skip_sha256=true. When skipped, no metadata_ext is included in the shard. This is useful for bucket uploads where the CAS sha_index is not queried and the SHA-256 overhead is significant. refactor: replace (sha256, skip_sha256) params with Sha256Policy enum Collapse the contradictory (sha256: Option<Sha256>, skip_sha256: bool) parameter pair into a single Sha256Policy enum with Compute, Provided, and Skip variants. This eliminates the invalid state at the type level and makes callsites self-documenting. Also adds test coverage for the ShaGenerator::Skip path and a comment documenting the WASM parity gap. refactor: move SHA-256 logic from ShaGenerator into SingleFileCleaner Remove the ShaGenerator enum and make the computation logic explicit in SingleFileCleaner using Option<Sha256Generator> + Option<Sha256>. Move Sha256Policy to file_cleaner.rs. test: add unit tests for Sha256Policy::Skip variant
21661b4 to
3652ea0
Compare
Bucket uploads don't need SHA-256 in the shard metadata (the sha_index GSI is only used for LFS pointer resolution, which doesn't apply to buckets). Pass skip_sha256=True to hf_xet.upload_files() and upload_bytes() in the bucket upload path to skip the SHA-256 computation, removing the main CPU bottleneck on non-SHA-NI instances. Depends on: huggingface/xet-core#679 Co-authored-by: Lucain <Wauplin@users.noreply.github.com>
* feat: pass skip_sha256=True to hf_xet for bucket uploads Bucket uploads don't need SHA-256 in the shard metadata (the sha_index GSI is only used for LFS pointer resolution, which doesn't apply to buckets). Pass skip_sha256=True to hf_xet.upload_files() and upload_bytes() in the bucket upload path to skip the SHA-256 computation, removing the main CPU bottleneck on non-SHA-NI instances. Depends on: huggingface/xet-core#679 Co-authored-by: Lucain <Wauplin@users.noreply.github.com> * test: use real bucket upload instead of mocks for skip_sha256 test Replace the two mock-based tests with a single integration test that: - Creates a real Bucket on staging Hub - Uploads files from both filepath and bytes in a single batch - Wraps (not mocks) hf_xet.upload_files and hf_xet.upload_bytes to verify skip_sha256=True is passed - Verifies files are actually uploaded by listing the bucket tree Co-authored-by: Lucain <Wauplin@users.noreply.github.com> * test: skip skip_sha256 test when hf_xet doesn't support it yet The test wraps the real hf_xet functions, so it fails when the installed hf_xet predates the skip_sha256 parameter (xet-core#679). Use inspect.signature to detect support and pytest.skip accordingly. Co-authored-by: Lucain <Wauplin@users.noreply.github.com> * test: handle built-in functions in skip_sha256 signature check hf_xet.upload_files is a compiled built-in function, so inspect.signature() raises ValueError. Catch it and skip the test when the signature can't be introspected (older hf_xet). Co-authored-by: Lucain <Wauplin@users.noreply.github.com> * fix: gracefully fall back when hf_xet lacks skip_sha256 support Use try/except TypeError around upload_files/upload_bytes calls with skip_sha256=True, falling back to calls without it for older hf_xet versions. TypeError for unknown kwargs on compiled functions is raised before any I/O, so the fallback is safe. Update test to check call_args_list[0] (the first attempt always includes skip_sha256=True) instead of requiring the function to accept it. Co-authored-by: Lucain <Wauplin@users.noreply.github.com> * better like this --------- Co-authored-by: Cursor Agent <cursoragent@cursor.com> Co-authored-by: Lucain <Wauplin@users.noreply.github.com>
Summary
ShaGenerator::Skipvariant that skips SHA-256 computation entirelyShaGenerator::finalize()now returnsOption<Sha256>(None when skipped)SingleFileCleaner::new()andFileUploadSession::start_clean()accept askip_sha256booleanFileMetadataExtis included in the shardContext
Bucket uploads don't need SHA-256 in the shard metadata — the
sha_indexGSI is only used for LFS pointer resolution, which doesn't apply to buckets. Skipping SHA-256 for bucket uploads removes the main CPU bottleneck in the upload pipeline on non-SHA-NI instances.Alternative: dummy SHA-256
Instead of skipping entirely, the client could send a zeroed/dummy
FileMetadataExt. The server would still store it but queries would never match. This avoids the server-side schema change (xetcas PR) but pollutes the GSI with dummy entries.Companion PRs:
FileIdItem.sha256optional server-side)