Skip to content

[BUG] Unreliable decoding of UTF7 strings with byte-order-mark #718

@jayaddison

Description

@jayaddison

Describe the bug
As described in the README, the library provides an interface to retrieve normalized text from an input stream.

In the case of some UTF7 strings with byte-order-marks, despite the library often correctly identifying the encoding as UTF7, the normalized/decoded text results are unreliable.

To Reproduce
Inline Python bytestring examples below:

from charset_normalizer import from_bytes
str(from_bytes('\ufeff.testing'.encode('utf-7')).best())  # works as expected
str(from_bytes('\ufeff-testing'.encode('utf-7')).best())  # returns extraneous leading minus-symbol
str(from_bytes('\ufeff+testing'.encode('utf-7')).best())  # returns extraneous b64-ish prefix

Expected behavior
The above would ideally return:

'.testing'
'-testing'
'+testing'

Actual behaviour
The results for v3.4.6 are:

'.testing'
'--testing'
'AKw-testing'

Logs
If applicable, add console outputs to help explain your problem.

Desktop (please complete the following information):

  • OS: Debian GNU/Linux (testing / forky)
  • Python version: 3.13.12
  • Package version: 3.4.6

Additional context
Encountered after exploring the issue described in #716.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghelp wantedExtra attention is needed

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions