UTF7: enable detection of empty-document with byte-order-mark#717
Conversation
One of the UTF7 byte order mark (BOM) variants was redundant, because it contained one of the other, shorter byte order mark variants as a prefix. This means that it would never provide detection of UTF7 at runtime, because the other prefix would always take priority. Remove it to remove a comparison when inspecting UTF7 BOM content.
This reverts commit f525200.
Checking for this special-case UTF7 BOM before the other cases
allows us to detect an empty document, which Python 3 encodes
into ASCII with a trailing minus symbol ('-').
jayaddison
left a comment
There was a problem hiding this comment.
I'm not sure that this is correct; the case I'm currently wondering about is when a bytestream starts with the empty-sequence pattern -- but then also continues with other, valid bytestream content.
In that scenario: using an equality check instead of .startswith would be more appropriate here.
|
Note / disclaimer: I recently browsed some of the |
|
Thank you @Ousret |
One of the UTF7 byte order mark (BOM) variants was redundant; it is an extension of one of the subsequent pattern variants. This does not affect correctness, but it does imply that string-matching work may be duplicated. Remove the longer, superset prefix to remove a comparison when inspecting some UTF7 BOM content. Refs: - Closes #716. - Follow-up to PR #717. - Relates-to commit b579cd6.
In UTF7 encoding, a standalone byte-order-mark (U+FEFF) encodes to the hex sequence
2b 2f 76 38 2d.This library defines a UTF7 constant to detect that condition -- but it was eclipsed by a preceding detection constant of
2b 2f 76 38(a prefix containing four-of-the-five bytes).Relocating the constant to the start of the UTF7 BOM constant list allows empty-content to be returned by the best-match detection.
Resolves #716.