UTF7: enable detection of empty-document with byte-order-mark by jayaddison · Pull Request #717 · jawah/charset_normalizer

jayaddison · 2026-03-16T23:46:27Z

In UTF7 encoding, a standalone byte-order-mark (U+FEFF) encodes to the hex sequence 2b 2f 76 38 2d.

This library defines a UTF7 constant to detect that condition -- but it was eclipsed by a preceding detection constant of 2b 2f 76 38 (a prefix containing four-of-the-five bytes).

Relocating the constant to the start of the UTF7 BOM constant list allows empty-content to be returned by the best-match detection.

>>> from charset_normalizer.api import from_bytes
>>> import unicodedata
>>> char = unicodedata.lookup('BOM')
>>> det = from_bytes(char.encode('utf-7'))
>>> str(det.best())
''

Resolves #716.

One of the UTF7 byte order mark (BOM) variants was redundant, because it contained one of the other, shorter byte order mark variants as a prefix. This means that it would never provide detection of UTF7 at runtime, because the other prefix would always take priority. Remove it to remove a comparison when inspecting UTF7 BOM content.

This reverts commit f525200.

Checking for this special-case UTF7 BOM before the other cases allows us to detect an empty document, which Python 3 encodes into ASCII with a trailing minus symbol ('-').

jayaddison

I'm not sure that this is correct; the case I'm currently wondering about is when a bytestream starts with the empty-sequence pattern -- but then also continues with other, valid bytestream content.

In that scenario: using an equality check instead of .startswith would be more appropriate here.

jayaddison · 2026-03-20T16:00:45Z

Note / disclaimer: I recently browsed some of the chardet v5.2.0...v6.0.0 code diff, as part of determinining whether to upgrade to version v7.0.0 of that library; those versions of chardet before v7.0.0 are LGPL-licensed. However: I am also willing to state that this branch/pull-request is entirely my own work; it was developed in combination with consulting public documentation on Wikipedia and the Python 3.14 documentation, and some experimentation at the Python command-line, to learn some of the practical details of the UTF7 encoding and byte-order-marks.

Ousret

lgtm.

the usage of startswith is correct, the order of sigs wasn't. the PR properly address the issue at hand.

regards,

jayaddison · 2026-03-22T14:44:40Z

Thank you @Ousret

jayaddison · 2026-03-22T14:48:44Z

@Ousret NB: I should not have included the Resoves ... statement in the pull request description here; because the changes as-merged relocate the pattern, rather than removing it entirely, I think that bugreport #716 remains valid.

One of the UTF7 byte order mark (BOM) variants was redundant; it is an extension of one of the subsequent pattern variants. This does not affect correctness, but it does imply that string-matching work may be duplicated. Remove the longer, superset prefix to remove a comparison when inspecting some UTF7 BOM content. Refs: - Closes #716. - Follow-up to PR #717. - Relates-to commit b579cd6.

jayaddison added 5 commits March 16, 2026 23:17

Tests: add detection coverage for UTF7 with BOM

4745ffe

Revert ":zap: Remove redundant UTF7 BOM"

5a00e59

This reverts commit f525200.

Tests: enable (failing) UTF7+BOM empty-content case

8cebd9c

🐛 Allow UTF7 BOM to detect empty-content case

2e55467

Checking for this special-case UTF7 BOM before the other cases allows us to detect an empty document, which Python 3 encodes into ASCII with a trailing minus symbol ('-').

jayaddison requested a review from Ousret as a code owner March 16, 2026 23:46

jayaddison commented Mar 17, 2026

View reviewed changes

jayaddison marked this pull request as draft March 17, 2026 18:06

Ousret marked this pull request as ready for review March 22, 2026 14:16

Ousret approved these changes Mar 22, 2026

View reviewed changes

Ousret merged commit 386539d into jawah:master Mar 22, 2026
42 checks passed

jayaddison deleted the issue-716/remove-redundant-utf7-bom-substring branch March 22, 2026 14:44

jayaddison mentioned this pull request Apr 2, 2026

Remove redundant UTF7 BOM #730

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UTF7: enable detection of empty-document with byte-order-mark#717

UTF7: enable detection of empty-document with byte-order-mark#717
Ousret merged 5 commits into
jawah:masterfrom
openculinary:issue-716/remove-redundant-utf7-bom-substring

jayaddison commented Mar 16, 2026

Uh oh!

jayaddison left a comment

Uh oh!

jayaddison commented Mar 20, 2026

Uh oh!

Ousret left a comment •

edited

Loading

Uh oh!

Uh oh!

jayaddison commented Mar 22, 2026

Uh oh!

jayaddison commented Mar 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

jayaddison commented Mar 16, 2026

Uh oh!

jayaddison left a comment

Choose a reason for hiding this comment

Uh oh!

jayaddison commented Mar 20, 2026

Uh oh!

Ousret left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jayaddison commented Mar 22, 2026

Uh oh!

jayaddison commented Mar 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

Ousret left a comment •

edited

Loading