Always use `encoding="utf-8-sig"` when reading text files #387

john-hen · 2025-06-08T15:08:49Z

Changed the encoding from utf8 to utf-8-sig when reading files, in order to ignore a possible byte-order mark (a.k.a. BOM, code point U+FEFF) at the start of the file.

As per the Python documentation:

In some areas, it is also convention to use a “BOM” at the start of
UTF-8 encoded files; the name is misleading since UTF-8 is not
byte-order dependent. The mark simply announces that the file is
encoded in UTF-8. For reading such files, use the ‘utf-8-sig’ codec
to automatically skip the mark if present.

https://docs.python.org/3/howto/unicode.html#reading-and-writing-unicode-data

So this change won't affect reading UTF8-encoded files without a BOM.

Fixes #386.

john-hen · 2025-06-08T15:34:05Z

I don't think this PR has anything to do with the reported Mypy errors, unless I'm missing something. (I only ran pytest before submitting.)

pawamoy · 2025-06-08T16:58:44Z

Thanks! You can rebase on main to get rid of the mypy warnings 👍

Changed the encoding from `utf8` to `utf-8-sig` throughout the code base when reading files, in order to ignore a possible byte-order mark (a.k.a. BOM, code point U+FEFF) at the start of the file. As per the Python documentation: > In some areas, it is also convention to use a “BOM” at the start of > UTF-8 encoded files; the name is misleading since UTF-8 is not > byte-order dependent. The mark simply announces that the file is > encoded in UTF-8. For reading such files, use the ‘utf-8-sig’ codec > to automatically skip the mark if present. https://docs.python.org/3/howto/unicode.html#reading-and-writing-unicode-data So this change won't affect reading UTF8-encoded files without a BOM.

pawamoy · 2025-06-08T17:10:16Z

Oh, can you please add a test that runs on Windows only, asserting the fix works? It should check that trying to load a BOM'd module with UTF8 raises a LoadingError, while it works with UTF8-SIG.

pawamoy

LGTM, thanks!

john-hen force-pushed the support-utf8-bom branch from 5b3816d to 603088f Compare June 8, 2025 17:04

fixup! Always use encoding="utf-8-sig" when reading text files

cbe63be

pawamoy approved these changes Jul 21, 2025

View reviewed changes

Merge branch 'main' into support-utf8-bom

c908ff5

pawamoy merged commit b346190 into mkdocstrings:main Jul 21, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Always use `encoding="utf-8-sig"` when reading text files #387

Always use `encoding="utf-8-sig"` when reading text files #387

Uh oh!

john-hen commented Jun 8, 2025 •

edited

Loading

Uh oh!

john-hen commented Jun 8, 2025 •

edited

Loading

Uh oh!

pawamoy commented Jun 8, 2025

Uh oh!

pawamoy commented Jun 8, 2025 •

edited

Loading

Uh oh!

pawamoy left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Always use encoding="utf-8-sig" when reading text files #387

Always use encoding="utf-8-sig" when reading text files #387

Uh oh!

Conversation

john-hen commented Jun 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

john-hen commented Jun 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pawamoy commented Jun 8, 2025

Uh oh!

pawamoy commented Jun 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pawamoy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Always use `encoding="utf-8-sig"` when reading text files #387

Always use `encoding="utf-8-sig"` when reading text files #387

john-hen commented Jun 8, 2025 •

edited

Loading

john-hen commented Jun 8, 2025 •

edited

Loading

pawamoy commented Jun 8, 2025 •

edited

Loading