Skip to content

UTF7: enable detection of empty-document with byte-order-mark#717

Merged
Ousret merged 5 commits into
jawah:masterfrom
openculinary:issue-716/remove-redundant-utf7-bom-substring
Mar 22, 2026
Merged

UTF7: enable detection of empty-document with byte-order-mark#717
Ousret merged 5 commits into
jawah:masterfrom
openculinary:issue-716/remove-redundant-utf7-bom-substring

Conversation

@jayaddison

Copy link
Copy Markdown
Contributor

In UTF7 encoding, a standalone byte-order-mark (U+FEFF) encodes to the hex sequence 2b 2f 76 38 2d.

This library defines a UTF7 constant to detect that condition -- but it was eclipsed by a preceding detection constant of 2b 2f 76 38 (a prefix containing four-of-the-five bytes).

Relocating the constant to the start of the UTF7 BOM constant list allows empty-content to be returned by the best-match detection.

>>> from charset_normalizer.api import from_bytes
>>> import unicodedata
>>> char = unicodedata.lookup('BOM')
>>> det = from_bytes(char.encode('utf-7'))
>>> str(det.best())
''

Resolves #716.

One of the UTF7 byte order mark (BOM) variants was redundant,
because it contained one of the other, shorter byte order mark
variants as a prefix.

This means that it would never provide detection of UTF7 at
runtime, because the other prefix would always take priority.

Remove it to remove a comparison when inspecting UTF7 BOM
content.
Checking for this special-case UTF7 BOM before the other cases
allows us to detect an empty document, which Python 3 encodes
into ASCII with a trailing minus symbol ('-').
@jayaddison jayaddison requested a review from Ousret as a code owner March 16, 2026 23:46

@jayaddison jayaddison left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that this is correct; the case I'm currently wondering about is when a bytestream starts with the empty-sequence pattern -- but then also continues with other, valid bytestream content.

In that scenario: using an equality check instead of .startswith would be more appropriate here.

@jayaddison jayaddison marked this pull request as draft March 17, 2026 18:06
@jayaddison

Copy link
Copy Markdown
Contributor Author

Note / disclaimer: I recently browsed some of the chardet v5.2.0...v6.0.0 code diff, as part of determinining whether to upgrade to version v7.0.0 of that library; those versions of chardet before v7.0.0 are LGPL-licensed. However: I am also willing to state that this branch/pull-request is entirely my own work; it was developed in combination with consulting public documentation on Wikipedia and the Python 3.14 documentation, and some experimentation at the Python command-line, to learn some of the practical details of the UTF7 encoding and byte-order-marks.

@Ousret Ousret marked this pull request as ready for review March 22, 2026 14:16

@Ousret Ousret left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm.

the usage of startswith is correct, the order of sigs wasn't. the PR properly address the issue at hand.

regards,

@Ousret Ousret merged commit 386539d into jawah:master Mar 22, 2026
42 checks passed
@jayaddison

Copy link
Copy Markdown
Contributor Author

Thank you @Ousret

@jayaddison jayaddison deleted the issue-716/remove-redundant-utf7-bom-substring branch March 22, 2026 14:44
@jayaddison

Copy link
Copy Markdown
Contributor Author

@Ousret NB: I should not have included the Resoves ... statement in the pull request description here; because the changes as-merged relocate the pattern, rather than removing it entirely, I think that bugreport #716 remains valid.

Ousret pushed a commit that referenced this pull request Apr 3, 2026
One of the UTF7 byte order mark (BOM) variants was redundant; it is an extension of one of the subsequent pattern variants.

This does not affect correctness, but it does imply that string-matching work may be duplicated.

Remove the longer, superset prefix to remove a comparison when inspecting some UTF7 BOM content.

Refs:
 - Closes #716.
 - Follow-up to PR #717.
 - Relates-to commit b579cd6.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

Unnecessary substring in UTF-7 BOM list

2 participants