Skip to content

correct cp_similarity ratio ceiling#704

Merged
Ousret merged 1 commit intojawah:masterfrom
ArmaanjeetSandhu:fix/bytearray-input-and-similarity-ratio
Mar 6, 2026
Merged

correct cp_similarity ratio ceiling#704
Ousret merged 1 commit intojawah:masterfrom
ArmaanjeetSandhu:fix/bytearray-input-and-similarity-ratio

Conversation

@ArmaanjeetSandhu
Copy link
Copy Markdown
Contributor

  • The Issue: utils.cp_similarity calculates the similarity between two code pages by decoding all possible single-byte sequences. However, the loop used range(255) (which skips the final byte, 0xFF) and divided the result by 254. If two encodings were exactly the same, this resulted in a similarity ratio of ~1.0039 instead of a maximum of 1.0.
  • The Fix: Updated the iteration to range(256) and the return calculation to character_match_count / 256.

Checklist

  • I have read the CONTRIBUTING.md document.
  • I have ensured that these changes do not break backward compatibility.
  • I have run the mandatory local checks successfully using nox (nox -s test, nox -s lint, and nox -s coverage).

The byte decoding loop only iterated 255 times
`range(255)`, missing the final `0xFF` byte and
causing the returned match ratio to exceed 1.0
(e.g., 255/254) when comparing identical code
pages. Updated the loop to `range(256)` and the
division to divide by 256 to ensure the maximum
similarity ratio is exactly 1.0.
Copy link
Copy Markdown
Member

@Ousret Ousret left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, does really impact anything, this function is only a helper to improve lib.

@Ousret Ousret merged commit e1e2ccb into jawah:master Mar 6, 2026
1 check passed
@ArmaanjeetSandhu ArmaanjeetSandhu deleted the fix/bytearray-input-and-similarity-ratio branch March 6, 2026 05:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants