Updated chunker to eliminate spurious boundary triggering.#487
Conversation
| // If we have a lot of data, don't read all the way to the end when we'll stop reading | ||
| // at the maximum chunk boundary. | ||
| let read_end = n_bytes.min(consume_len + self.maximum_chunk - cur_chunk_len); | ||
| let read_end = n_bytes.min(cur_index + self.maximum_chunk - previous_len); |
There was a problem hiding this comment.
The number cur_index + self.maximum_chunk - previous_len doesn't seem to be identical to the previous value consume_len + self.maximum_chunk - cur_chunk_len. This result seems to pass the max chunk size boundary.
There was a problem hiding this comment.
But I guess this doesn't impact the below logic to cut at the max chunk size
There was a problem hiding this comment.
It should be the same -- those variables were mainly renamed.
There was a problem hiding this comment.
See the above if statement, previously we have both consume_len += max_advance and cur_chunk_len += max_advance; but now only cur_index += skip;
There was a problem hiding this comment.
Ah, true... I think that was actually in error too. Which would cause even more spurious results, as when there is data in the buffer we start chunking earlier, allowing even smaller chunks.
seanses
left a comment
There was a problem hiding this comment.
Optionally, also explicitly reset the hash when is_final is true and document for the finish() function that the chunker can be reused.
updates new wasm code to include xet wasm chunker fix from huggingface/xet-core#487 wasm bin size 99K

We must enforce that the next boundary is actually past the minimum chunk size. In rare cases, a boundary could be triggered before the minimum chunk size has passed, and this triggering would be based not on the content of the file but on the previous state of the rolling hash function. This PR fixes this case.