correct cp_similarity ratio ceiling#704
Merged
Ousret merged 1 commit intojawah:masterfrom Mar 6, 2026
Merged
Conversation
The byte decoding loop only iterated 255 times `range(255)`, missing the final `0xFF` byte and causing the returned match ratio to exceed 1.0 (e.g., 255/254) when comparing identical code pages. Updated the loop to `range(256)` and the division to divide by 256 to ensure the maximum similarity ratio is exactly 1.0.
Ousret
approved these changes
Mar 6, 2026
Member
Ousret
left a comment
There was a problem hiding this comment.
lgtm, does really impact anything, this function is only a helper to improve lib.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
utils.cp_similaritycalculates the similarity between two code pages by decoding all possible single-byte sequences. However, the loop usedrange(255)(which skips the final byte,0xFF) and divided the result by254. If two encodings were exactly the same, this resulted in a similarity ratio of~1.0039instead of a maximum of1.0.range(256)and the return calculation tocharacter_match_count / 256.Checklist
CONTRIBUTING.mddocument.nox(nox -s test,nox -s lint, andnox -s coverage).