Skip to content

Updated chunker to eliminate spurious boundary triggering.#487

Merged
hoytak merged 4 commits intomainfrom
hoytak/250909-min-chunk-correctness
Sep 10, 2025
Merged

Updated chunker to eliminate spurious boundary triggering.#487
hoytak merged 4 commits intomainfrom
hoytak/250909-min-chunk-correctness

Conversation

@hoytak
Copy link
Collaborator

@hoytak hoytak commented Sep 9, 2025

We must enforce that the next boundary is actually past the minimum chunk size. In rare cases, a boundary could be triggered before the minimum chunk size has passed, and this triggering would be based not on the content of the file but on the previous state of the rolling hash function. This PR fixes this case.

Copy link
Collaborator

@seanses seanses left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug

Copy link
Contributor

@assafvayner assafvayner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@assafvayner assafvayner requested a review from seanses September 10, 2025 01:03
// If we have a lot of data, don't read all the way to the end when we'll stop reading
// at the maximum chunk boundary.
let read_end = n_bytes.min(consume_len + self.maximum_chunk - cur_chunk_len);
let read_end = n_bytes.min(cur_index + self.maximum_chunk - previous_len);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number cur_index + self.maximum_chunk - previous_len doesn't seem to be identical to the previous value consume_len + self.maximum_chunk - cur_chunk_len. This result seems to pass the max chunk size boundary.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I guess this doesn't impact the below logic to cut at the max chunk size

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be the same -- those variables were mainly renamed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the above if statement, previously we have both consume_len += max_advance and cur_chunk_len += max_advance; but now only cur_index += skip;

Copy link
Collaborator Author

@hoytak hoytak Sep 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, true... I think that was actually in error too. Which would cause even more spurious results, as when there is data in the buffer we start chunking earlier, allowing even smaller chunks.

Copy link
Collaborator

@seanses seanses left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optionally, also explicitly reset the hash when is_final is true and document for the finish() function that the chunker can be reused.

Copy link
Collaborator

@seanses seanses left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This leads to some dedup loss against existing Xet data. But as we expect more data in the future and to conform to the OSS spec, this is fine.

From experiments against existing models (obtained using xtool dedup command), for some model files we will see 13% loss:

dedup loss

@hoytak hoytak merged commit fe71e3d into main Sep 10, 2025
6 checks passed
@hoytak hoytak deleted the hoytak/250909-min-chunk-correctness branch September 10, 2025 23:52
coyotte508 pushed a commit to huggingface/huggingface.js that referenced this pull request Sep 11, 2025
updates new wasm code to include xet wasm chunker fix from
huggingface/xet-core#487

wasm bin size 99K
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants