Updated chunker to eliminate spurious boundary triggering. by hoytak · Pull Request #487 · huggingface/xet-core

hoytak · 2025-09-09T20:52:27Z

We must enforce that the next boundary is actually past the minimum chunk size. In rare cases, a boundary could be triggered before the minimum chunk size has passed, and this triggering would be based not on the content of the file but on the previous state of the rolling hash function. This PR fixes this case.

deduplication/src/chunking.rs

seanses

Bug

assafvayner

LGTM

seanses · 2025-09-10T22:12:40Z

deduplication/src/chunking.rs

        // If we have a lot of data, don't read all the way to the end when we'll stop reading
        // at the maximum chunk boundary.
-        let read_end = n_bytes.min(consume_len + self.maximum_chunk - cur_chunk_len);
+        let read_end = n_bytes.min(cur_index + self.maximum_chunk - previous_len);


The number cur_index + self.maximum_chunk - previous_len doesn't seem to be identical to the previous value consume_len + self.maximum_chunk - cur_chunk_len. This result seems to pass the max chunk size boundary.

But I guess this doesn't impact the below logic to cut at the max chunk size

It should be the same -- those variables were mainly renamed.

See the above if statement, previously we have both consume_len += max_advance and cur_chunk_len += max_advance; but now only cur_index += skip;

Ah, true... I think that was actually in error too. Which would cause even more spurious results, as when there is data in the buffer we start chunking earlier, allowing even smaller chunks.

seanses

Optionally, also explicitly reset the hash when is_final is true and document for the finish() function that the chunker can be reused.

seanses

This leads to some dedup loss against existing Xet data. But as we expect more data in the future and to conform to the OSS spec, this is fine.

From experiments against existing models (obtained using xtool dedup command), for some model files we will see 13% loss:

updates new wasm code to include xet wasm chunker fix from huggingface/xet-core#487 wasm bin size 99K

Updated chunker to eliminate spurious boundary triggering.

447ba97

hoytak requested review from assafvayner and seanses September 9, 2025 20:52

seanses reviewed Sep 9, 2025

View reviewed changes

deduplication/src/chunking.rs Outdated Show resolved Hide resolved

seanses requested changes Sep 9, 2025

View reviewed changes

hoytak added 2 commits September 9, 2025 15:44

Corrected and simplified logic in the chunker.

69fc075

Updated wasm.

8cbbadc

assafvayner approved these changes Sep 10, 2025

View reviewed changes

assafvayner requested a review from seanses September 10, 2025 01:03

seanses reviewed Sep 10, 2025

View reviewed changes

seanses approved these changes Sep 10, 2025

View reviewed changes

Explicit state reset on is_final.

0006f3b

hoytak merged commit fe71e3d into main Sep 10, 2025
6 checks passed

hoytak deleted the hoytak/250909-min-chunk-correctness branch September 10, 2025 23:52

assafvayner mentioned this pull request Sep 10, 2025

update xet wasm chunker huggingface/huggingface.js#1742

Merged

coyotte508 pushed a commit to huggingface/huggingface.js that referenced this pull request Sep 11, 2025

update xet wasm chunker (#1742)

fc4764e

updates new wasm code to include xet wasm chunker fix from huggingface/xet-core#487 wasm bin size 99K

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updated chunker to eliminate spurious boundary triggering.#487

Updated chunker to eliminate spurious boundary triggering.#487
hoytak merged 4 commits intomainfrom
hoytak/250909-min-chunk-correctness

hoytak commented Sep 9, 2025

Uh oh!

Uh oh!

seanses left a comment

Uh oh!

assafvayner left a comment

Uh oh!

seanses Sep 10, 2025

Uh oh!

seanses Sep 10, 2025

Uh oh!

hoytak Sep 10, 2025

Uh oh!

seanses Sep 10, 2025

Uh oh!

hoytak Sep 10, 2025 •

edited

Loading

Uh oh!

seanses left a comment

Uh oh!

seanses left a comment •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

hoytak commented Sep 9, 2025

Uh oh!

Uh oh!

seanses left a comment

Choose a reason for hiding this comment

Uh oh!

assafvayner left a comment

Choose a reason for hiding this comment

Uh oh!

seanses Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

seanses Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

hoytak Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

seanses Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

hoytak Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seanses left a comment

Choose a reason for hiding this comment

Uh oh!

seanses left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hoytak Sep 10, 2025 •

edited

Loading

seanses left a comment •

edited

Loading