Describe the bug
As described in the README, the library provides an interface to retrieve normalized text from an input stream.
In the case of some UTF7 strings with byte-order-marks, despite the library often correctly identifying the encoding as UTF7, the normalized/decoded text results are unreliable.
To Reproduce
Inline Python bytestring examples below:
from charset_normalizer import from_bytes
str(from_bytes('\ufeff.testing'.encode('utf-7')).best()) # works as expected
str(from_bytes('\ufeff-testing'.encode('utf-7')).best()) # returns extraneous leading minus-symbol
str(from_bytes('\ufeff+testing'.encode('utf-7')).best()) # returns extraneous b64-ish prefix
Expected behavior
The above would ideally return:
'.testing'
'-testing'
'+testing'
Actual behaviour
The results for v3.4.6 are:
'.testing'
'--testing'
'AKw-testing'
Logs
If applicable, add console outputs to help explain your problem.
Desktop (please complete the following information):
- OS: Debian GNU/Linux (
testing / forky)
- Python version: 3.13.12
- Package version: 3.4.6
Additional context
Encountered after exploring the issue described in #716.
Describe the bug
As described in the README, the library provides an interface to retrieve normalized text from an input stream.
In the case of some UTF7 strings with byte-order-marks, despite the library often correctly identifying the encoding as UTF7, the normalized/decoded text results are unreliable.
To Reproduce
Inline Python bytestring examples below:
Expected behavior
The above would ideally return:
Actual behaviour
The results for v3.4.6 are:
Logs
If applicable, add console outputs to help explain your problem.
Desktop (please complete the following information):
testing/forky)Additional context
Encountered after exploring the issue described in #716.