Handle case when file size is too small for sample size#16
Conversation
This fixes the case where hashing may crash when using custom parameters if the sample size is too large for the file. - Update the spec and code to correctly constraint mode based on file size - Update tests - Minor lint fixes
|
I might just be missing something or not understand the algorithm correctly, but I'm not sure I quite understand where |
|
Oh wait, I think I realize my mistake. I confused myself by thinking the "middle" read would start directly after the "start" read in the case of |
|
@Hakkin You raise a good point, and funnily enough I noticed Since this PR effectively defines new behavior anyway, I'm coming around to the idea of |
|
As alternative, one could require that |
|
I want to maintain stability as much as possible for the well-defined part and keep the selection pattern the same. The change now is defining previously undefined behavior, so there is a little more latitude. BTW, this topic will also be revisited when I implement a user-defined number of chunks in the future. For now, I like the cutoff being: I've updated this PR with the change. And will get this released as v1.1.0. Thank you @hiql for the prompt update. Would you mind changing the cutoff and test vectors one more time? 🙏 |
Issue
Behavior can be unpredictable if the file size is less than ~2x the sample size and sampling is used. This could include hash values that differ between imohash implementations, or even crashing (see #15).
Affected Users
This affects users who:
sample_size>2*sample_thresholdsample_threshold-->2*sample_sizeAnalysis
imohash samples at the beginning, middle, and end of the file. If the sample size is
s, the file needs to be at least2s-1to sample the middle without hitting EOF. Furthermore, ifsis larger than the whole file, then there can be a seek() error when trying to samplesback from the end of the file.The original spec didn't correctly address this. There is a check described relating the sample size and sample threshold, but it doesn't really make sense. What matters is the actual file size and the sample size. Furthermore, the check was never implemented anyway!
Since the default sample size is 0.125 of the default threshold, the problem condition is never hit when using defaults.
Fix
The spec and code have been updated to check for sample size relative to file size and choose the correct mode.
If you're using custom parameters in the range described above and saving the hash, the hash post-upgrade will differ for files within the affected size range. In practice, this is a rare case. Most uses are for synchronization, and the hashes are ephemeral. Technically, this is a breaking change. However, given the narrow nature of this issue, I'm merely going to v1.1.0 with it and not a full v2.
Fixes #15