Test with chardet 6.0.0.post1#1017
Conversation
Merging this PR will not alter performance
Comparing Footnotes
|
The previous French sample used only accented letters in bytes 0xA0-0xFF, where ISO-8859-1 and WINDOWS-1252 are byte-identical, so chardet was free to report either encoding. Adding curly quotes and an em dash (bytes 0x80-0x9F, control characters in ISO-8859-1) makes the byte string unambiguously WINDOWS-1252 across chardet versions, so the expectation no longer needs to change on upgrade.
Kludex
left a comment
There was a problem hiding this comment.
I think it's more understandable if it's unambiguous - so I modify the input a bit.
Thanks!
I think that this is an excellent improvement. Thanks! The new texts also detect as Windows-1252 with |
😢 Happy to review yours PRs at any time! 🙏 |
Summary
Run the tests with
chardetversion6.0.0.post1instead of5.2.0. Adjust expectations in three tests, in which the encoded text is short enough that the encoding may be detected ambiguously, andchardetmay validly report a different encoding than the one that was specified when encoding the text.This is an alternative approach to #1016. Based on feedback, it drops support for testing with
chardet5.x and focuses solely on 6.x.This also tightens the assertion that was loosened in encode/httpx#3773; in #1016 (comment), @Kludex wrote that “I don't think we should have "in" in assertions.” All three tests now use identical assertions and are documented with identical comments.
Checklist
Additional information
Further adjustments would be needed to use the 7.x branch of
chardet, which was rewritten with an LLM and controversially relicensed. See #1016 (comment) for more details.