Skip to content

Conversation

@john-hen
Copy link
Contributor

@john-hen john-hen commented Jun 8, 2025

Changed the encoding from utf8 to utf-8-sig when reading files, in order to ignore a possible byte-order mark (a.k.a. BOM, code point U+FEFF) at the start of the file.

As per the Python documentation:

In some areas, it is also convention to use a “BOM” at the start of
UTF-8 encoded files; the name is misleading since UTF-8 is not
byte-order dependent. The mark simply announces that the file is
encoded in UTF-8. For reading such files, use the ‘utf-8-sig’ codec
to automatically skip the mark if present.

https://docs.python.org/3/howto/unicode.html#reading-and-writing-unicode-data

So this change won't affect reading UTF8-encoded files without a BOM.

Fixes #386.

@john-hen
Copy link
Contributor Author

john-hen commented Jun 8, 2025

I don't think this PR has anything to do with the reported Mypy errors, unless I'm missing something. (I only ran pytest before submitting.)

@pawamoy
Copy link
Member

pawamoy commented Jun 8, 2025

Thanks! You can rebase on main to get rid of the mypy warnings 👍

Changed the encoding from `utf8` to `utf-8-sig` throughout the code base
when reading files, in order to ignore a possible byte-order mark
(a.k.a. BOM, code point U+FEFF) at the start of the file.

As per the Python documentation:
> In some areas, it is also convention to use a “BOM” at the start of
> UTF-8 encoded files; the name is misleading since UTF-8 is not
> byte-order dependent. The mark simply announces that the file is
> encoded in UTF-8. For reading such files, use the ‘utf-8-sig’ codec
> to automatically skip the mark if present.

https://docs.python.org/3/howto/unicode.html#reading-and-writing-unicode-data

So this change won't affect reading UTF8-encoded files without a BOM.
@john-hen john-hen force-pushed the support-utf8-bom branch from 5b3816d to 603088f Compare June 8, 2025 17:04
@pawamoy
Copy link
Member

pawamoy commented Jun 8, 2025

Oh, can you please add a test that runs on Windows only, asserting the fix works? It should check that trying to load a BOM'd module with UTF8 raises a LoadingError, while it works with UTF8-SIG.

Copy link
Member

@pawamoy pawamoy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@pawamoy pawamoy merged commit b346190 into mkdocstrings:main Jul 21, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: Failure to parse files with UTF8 byte-order mark

2 participants