Skip to content

Test with chardet 6.0.0.post1#1017

Merged
Kludex merged 2 commits into
pydantic:mainfrom
musicinmybrain:chardet6-again
Jun 2, 2026
Merged

Test with chardet 6.0.0.post1#1017
Kludex merged 2 commits into
pydantic:mainfrom
musicinmybrain:chardet6-again

Conversation

@musicinmybrain

Copy link
Copy Markdown
Contributor

Summary

Run the tests with chardet version 6.0.0.post1 instead of 5.2.0. Adjust expectations in three tests, in which the encoded text is short enough that the encoding may be detected ambiguously, and chardet may validly report a different encoding than the one that was specified when encoding the text.

This is an alternative approach to #1016. Based on feedback, it drops support for testing with chardet 5.x and focuses solely on 6.x.

This also tightens the assertion that was loosened in encode/httpx#3773; in #1016 (comment), @Kludex wrote that “I don't think we should have "in" in assertions.” All three tests now use identical assertions and are documented with identical comments.

Checklist

  • I understand that this PR may be closed in case there was no previous discussion. (This doesn't apply to typos!)
  • I've added a test for each change that was introduced, and I tried as much as possible to make a single atomic change. N/A; this is a testing change.
  • I've updated the documentation accordingly. Nothing to update

Additional information

Further adjustments would be needed to use the 7.x branch of chardet, which was rewritten with an LLM and controversially relicensed. See #1016 (comment) for more details.

@codspeed-hq

codspeed-hq Bot commented Jun 2, 2026

Copy link
Copy Markdown

Merging this PR will not alter performance

✅ 15 untouched benchmarks
⏩ 7 skipped benchmarks1


Comparing musicinmybrain:chardet6-again (2591068) with main (2b0daf5)2

Open in CodSpeed

Footnotes

  1. 7 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

  2. No successful run was found on main (30b3837) during the generation of this report, so 2b0daf5 was used instead as the comparison base. There might be some changes unrelated to this pull request in this report.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 4 files

Re-trigger cubic

The previous French sample used only accented letters in bytes 0xA0-0xFF,
where ISO-8859-1 and WINDOWS-1252 are byte-identical, so chardet was free to
report either encoding. Adding curly quotes and an em dash (bytes 0x80-0x9F,
control characters in ISO-8859-1) makes the byte string unambiguously
WINDOWS-1252 across chardet versions, so the expectation no longer needs to
change on upgrade.

@Kludex Kludex left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's more understandable if it's unambiguous - so I modify the input a bit.

Thanks!

@Kludex Kludex merged commit 26dd31e into pydantic:main Jun 2, 2026
12 checks passed
@musicinmybrain

Copy link
Copy Markdown
Contributor Author

I think it's more understandable if it's unambiguous - so I modify the input a bit.

I think that this is an excellent improvement. Thanks!

The new texts also detect as Windows-1252 with chardet 7.x, which is nice. Unfortunately, chardet 7.x uses the string 'Windows-1252', vs. chardet 6.x’s 'WINDOWS-1252', so the expectations would still have to be adjusted again, but at least the discrepancy is smaller.

@Kludex

Kludex commented Jun 2, 2026

Copy link
Copy Markdown
Member

chardet 7.x uses the string 'Windows-1252', vs. chardet 6.x’s 'WINDOWS-1252'

😢


Happy to review yours PRs at any time! 🙏

@Kludex Kludex mentioned this pull request Jun 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants