Fix reader for Unicode code points over 0xFFFF#351
Fix reader for Unicode code points over 0xFFFF#351perlpunk merged 1 commit intoyaml:release/5.3from anishathalye:fix-unicode-non-bmp-ucs-2
Conversation
This patch fixes the handling of inputs with Unicode code points over
0xFFFF when running on a Python 2 that does not have UCS-4 support
(which certain distributions still ship, e.g. macOS).
When Python is compiled without UCS-4 support, it uses UCS-2. In this
situation, non-BMP Unicode characters, which have code points over
0xFFFF, are represented as surrogate pairs. For example, if we take
u'\U0001f3d4', it will be represented as the surrogate pair
u'\ud83c\udfd4'. This can be seen by running, for example:
[i for i in u'\U0001f3d4']
In PyYAML, the reader uses a function `check_printable` to validate
inputs, making sure that they only contain printable characters. Prior
to this patch, on UCS-2 builds, it incorrectly identified surrogate
pairs as non-printable.
It would be fairly natural to write a regular expression that captures
strings that contain only *printable* characters, as opposed to
*non-printable* characters (as identified by the old code, so not
excluding surrogate pairs):
PRINTABLE = re.compile(u'^[\x09\x0A\x0D\x20-\x7E\x85\xA0-\uD7FF\uE000-\uFFFD]*$')
Adding support for surrogate pairs to this would be straightforward,
adding the option of having a surrogate high followed by a surrogate low
(`[\uD800-\uDBFF][\uDC00-\uDFFF]`):
PRINTABLE = re.compile(u'^(?:[\x09\x0A\x0D\x20-\x7E\x85\xA0-\uD7FF\uE000-\uFFFD]|[\uD800-\uDBFF][\uDC00-\uDFFF])*$')
Then, this regex could be used as follows:
def check_printable(self, data):
if not self.PRINTABLE.match(data):
raise ReaderError(...)
However, matching printable strings, rather than searching for
non-printable characters as the code currently does, would have the
disadvantage of not identifying the culprit character (we wouldn't get
the position and the actual non-printable character from a lack of a
regex match).
Instead, we can modify the NON_PRINTABLE regex to allow legal surrogate
pairs. We do this by removing surrogate pairs from the existing
character set and adding the following options for illegal uses of
surrogate code points:
- Surrogate low that doesn't follow a surrogate high (either a surrogate
low at the start of a string, or a surrogate low that follows a
character that's not a surrogate high):
(?:^|[^\uD800-\uDBFF])[\uDC00-\uDFFF]
- Surrogate high that isn't followed by a surrogate low (either a
surrogate high at the end of a string, or a surrogate high that is
followed by a character that's not a surrogate low):
[\uD800-\uDBFF](?:[^\uDC00-\uDFFF]|$)
The behavior of this modified regex should match the one that is used
when Python is built with UCS-4 support.
|
Thanks! |
|
Oh yeah, I was going to do that and see what happened, but I forgot. Enabling those tests as part of this PR would make sense. The tests can be enabled as follows: diff --git i/tests/lib/test_appliance.py w/tests/lib/test_appliance.py
index 5ec4575..7822979 100644
--- i/tests/lib/test_appliance.py
+++ w/tests/lib/test_appliance.py
@@ -27,8 +27,6 @@ def find_test_filenames(directory):
base, ext = os.path.splitext(filename)
if base.endswith('-py3'):
continue
- if not has_ucs4 and base.find('-ucs4-') > -1:
- continue
filenames.setdefault(base, []).append(ext)
filenames = filenames.items()
filenames.sort()(it might make sense to rename those For comparison, here's what happens with the UCS-4 tests enabled for a non-UCS-4 Python 2 build, prior to this patch: Here's what happens with this patch: The single test that's still failing involves encoding/decoding |
|
We decided to get out a 5.2 as soon as possible and not add more PRs to it, but I hope we can do 5.3 soon after that and include this PR! So far I haven't found out what the |
|
Sounds good! Hopefully this can be included in 5.2.1, as it's more of a bugfix than a new feature? Looks like that file was originally introduced in this commit: https://bitbucket.org/xi/pyyaml/commits/6c7c12bb8fe64828e400a5a6c0fb5dd9911392c3 It references this issue: https://bitbucket.org/xi/pyyaml/issues/11/ |
Thanks for going through the commits, I hadn't saved the commit id after I searched for this :) Actually it references this one, as bitbucket was introduced later: |
|
Ah that makes sense! The referenced issue on Bitbucket didn't really seem to match the commit |
|
Thanks! merged to release/5.3 |
After #351 the tests are working again
This patch fixes the handling of inputs with Unicode code points over
0xFFFF when running on a Python 2 that does not have UCS-4 support
(which certain distributions still ship, e.g. macOS).
When Python is compiled without UCS-4 support, it uses UCS-2. In this
situation, non-BMP Unicode characters, which have code points over
0xFFFF, are represented as surrogate pairs. For example, if we take
u'\U0001f3d4', it will be represented as the surrogate pair
u'\ud83c\udfd4'. This can be seen by running, for example:
[i for i in u'\U0001f3d4']
In PyYAML, the reader uses a function `check_printable` to validate
inputs, making sure that they only contain printable characters. Prior
to this patch, on UCS-2 builds, it incorrectly identified surrogate
pairs as non-printable.
It would be fairly natural to write a regular expression that captures
strings that contain only *printable* characters, as opposed to
*non-printable* characters (as identified by the old code, so not
excluding surrogate pairs):
PRINTABLE = re.compile(u'^[\x09\x0A\x0D\x20-\x7E\x85\xA0-\uD7FF\uE000-\uFFFD]*$')
Adding support for surrogate pairs to this would be straightforward,
adding the option of having a surrogate high followed by a surrogate low
(`[\uD800-\uDBFF][\uDC00-\uDFFF]`):
PRINTABLE = re.compile(u'^(?:[\x09\x0A\x0D\x20-\x7E\x85\xA0-\uD7FF\uE000-\uFFFD]|[\uD800-\uDBFF][\uDC00-\uDFFF])*$')
Then, this regex could be used as follows:
def check_printable(self, data):
if not self.PRINTABLE.match(data):
raise ReaderError(...)
However, matching printable strings, rather than searching for
non-printable characters as the code currently does, would have the
disadvantage of not identifying the culprit character (we wouldn't get
the position and the actual non-printable character from a lack of a
regex match).
Instead, we can modify the NON_PRINTABLE regex to allow legal surrogate
pairs. We do this by removing surrogate pairs from the existing
character set and adding the following options for illegal uses of
surrogate code points:
- Surrogate low that doesn't follow a surrogate high (either a surrogate
low at the start of a string, or a surrogate low that follows a
character that's not a surrogate high):
(?:^|[^\uD800-\uDBFF])[\uDC00-\uDFFF]
- Surrogate high that isn't followed by a surrogate low (either a
surrogate high at the end of a string, or a surrogate high that is
followed by a character that's not a surrogate low):
[\uD800-\uDBFF](?:[^\uDC00-\uDFFF]|$)
The behavior of this modified regex should match the one that is used
when Python is built with UCS-4 support.
|
released https://pypi.org/project/PyYAML/5.3/ |
https://github.com/yaml/pyyaml/blob/d0d660d035905d9c49fc0f8dafb579d2cc68c0c8/CHANGES#L7 5.3.1 (2020-03-18) * yaml/pyyaml#386 -- Prevents arbitrary code execution during python/object/new constructor 5.3 (2020-01-06) * yaml/pyyaml#290 -- Use `is` instead of equality for comparing with `None` * yaml/pyyaml#270 -- fix typos and stylistic nit * yaml/pyyaml#309 -- Fix up small typo * yaml/pyyaml#161 -- Fix handling of __slots__ * yaml/pyyaml#358 -- Allow calling add_multi_constructor with None * yaml/pyyaml#285 -- Add use of safe_load() function in README * yaml/pyyaml#351 -- Fix reader for Unicode code points over 0xFFFF * yaml/pyyaml#360 -- Enable certain unicode tests when maxunicode not > 0xffff * yaml/pyyaml#359 -- Use full_load in yaml-highlight example * yaml/pyyaml#244 -- Document that PyYAML is implemented with Cython * yaml/pyyaml#329 -- Fix for Python 3.10 * yaml/pyyaml#310 -- increase size of index, line, and column fields * yaml/pyyaml#260 -- remove some unused imports * yaml/pyyaml#163 -- Create timezone-aware datetimes when parsed as such * yaml/pyyaml#363 -- Add tests for timezone 5.2 (2019-12-02) ------------------ * Repair incompatibilities introduced with 5.1. The default Loader was changed, but several methods like add_constructor still used the old default yaml/pyyaml#279 -- A more flexible fix for custom tag constructors yaml/pyyaml#287 -- Change default loader for yaml.add_constructor yaml/pyyaml#305 -- Change default loader for add_implicit_resolver, add_path_resolver * Make FullLoader safer by removing python/object/apply from the default FullLoader yaml/pyyaml#347 -- Move constructor for object/apply to UnsafeConstructor * Fix bug introduced in 5.1 where quoting went wrong on systems with sys.maxunicode <= 0xffff yaml/pyyaml#276 -- Fix logic for quoting special characters * Other PRs: yaml/pyyaml#280 -- Update CHANGES for 5.1
https://github.com/yaml/pyyaml/blob/d0d660d035905d9c49fc0f8dafb579d2cc68c0c8/CHANGES#L7 5.3.1 (2020-03-18) * yaml/pyyaml#386 -- Prevents arbitrary code execution during python/object/new constructor 5.3 (2020-01-06) * yaml/pyyaml#290 -- Use `is` instead of equality for comparing with `None` * yaml/pyyaml#270 -- fix typos and stylistic nit * yaml/pyyaml#309 -- Fix up small typo * yaml/pyyaml#161 -- Fix handling of __slots__ * yaml/pyyaml#358 -- Allow calling add_multi_constructor with None * yaml/pyyaml#285 -- Add use of safe_load() function in README * yaml/pyyaml#351 -- Fix reader for Unicode code points over 0xFFFF * yaml/pyyaml#360 -- Enable certain unicode tests when maxunicode not > 0xffff * yaml/pyyaml#359 -- Use full_load in yaml-highlight example * yaml/pyyaml#244 -- Document that PyYAML is implemented with Cython * yaml/pyyaml#329 -- Fix for Python 3.10 * yaml/pyyaml#310 -- increase size of index, line, and column fields * yaml/pyyaml#260 -- remove some unused imports * yaml/pyyaml#163 -- Create timezone-aware datetimes when parsed as such * yaml/pyyaml#363 -- Add tests for timezone 5.2 (2019-12-02) ------------------ * Repair incompatibilities introduced with 5.1. The default Loader was changed, but several methods like add_constructor still used the old default yaml/pyyaml#279 -- A more flexible fix for custom tag constructors yaml/pyyaml#287 -- Change default loader for yaml.add_constructor yaml/pyyaml#305 -- Change default loader for add_implicit_resolver, add_path_resolver * Make FullLoader safer by removing python/object/apply from the default FullLoader yaml/pyyaml#347 -- Move constructor for object/apply to UnsafeConstructor * Fix bug introduced in 5.1 where quoting went wrong on systems with sys.maxunicode <= 0xffff yaml/pyyaml#276 -- Fix logic for quoting special characters * Other PRs: yaml/pyyaml#280 -- Update CHANGES for 5.1
https://github.com/yaml/pyyaml/blob/d0d660d035905d9c49fc0f8dafb579d2cc68c0c8/CHANGES#L7 5.3.1 (2020-03-18) * yaml/pyyaml#386 -- Prevents arbitrary code execution during python/object/new constructor 5.3 (2020-01-06) * yaml/pyyaml#290 -- Use `is` instead of equality for comparing with `None` * yaml/pyyaml#270 -- fix typos and stylistic nit * yaml/pyyaml#309 -- Fix up small typo * yaml/pyyaml#161 -- Fix handling of __slots__ * yaml/pyyaml#358 -- Allow calling add_multi_constructor with None * yaml/pyyaml#285 -- Add use of safe_load() function in README * yaml/pyyaml#351 -- Fix reader for Unicode code points over 0xFFFF * yaml/pyyaml#360 -- Enable certain unicode tests when maxunicode not > 0xffff * yaml/pyyaml#359 -- Use full_load in yaml-highlight example * yaml/pyyaml#244 -- Document that PyYAML is implemented with Cython * yaml/pyyaml#329 -- Fix for Python 3.10 * yaml/pyyaml#310 -- increase size of index, line, and column fields * yaml/pyyaml#260 -- remove some unused imports * yaml/pyyaml#163 -- Create timezone-aware datetimes when parsed as such * yaml/pyyaml#363 -- Add tests for timezone 5.2 (2019-12-02) ------------------ * Repair incompatibilities introduced with 5.1. The default Loader was changed, but several methods like add_constructor still used the old default yaml/pyyaml#279 -- A more flexible fix for custom tag constructors yaml/pyyaml#287 -- Change default loader for yaml.add_constructor yaml/pyyaml#305 -- Change default loader for add_implicit_resolver, add_path_resolver * Make FullLoader safer by removing python/object/apply from the default FullLoader yaml/pyyaml#347 -- Move constructor for object/apply to UnsafeConstructor * Fix bug introduced in 5.1 where quoting went wrong on systems with sys.maxunicode <= 0xffff yaml/pyyaml#276 -- Fix logic for quoting special characters * Other PRs: yaml/pyyaml#280 -- Update CHANGES for 5.1
Fixes #350
This patch fixes the handling of inputs with Unicode code points over 0xFFFF when running on a Python 2 that does not have UCS-4 support (which certain distributions still ship, e.g. macOS).
When Python is compiled without UCS-4 support, it uses UCS-2. In this situation, non-BMP Unicode characters, which have code points over 0xFFFF, are represented as surrogate pairs. For example, if we take u'\U0001f3d4', it will be represented as the surrogate pair u'\ud83c\udfd4'. This can be seen by running, for example:
In PyYAML, the reader uses a function
check_printableto validate inputs, making sure that they only contain printable characters. Prior to this patch, on UCS-2 builds, it incorrectly identified surrogate pairs as non-printable.It would be fairly natural to write a regular expression that captures strings that contain only printable characters, as opposed to non-printable characters (as identified by the old code, so not excluding surrogate pairs):
Adding support for surrogate pairs to this would be straightforward, adding the option of having a surrogate high followed by a surrogate low (
[\uD800-\uDBFF][\uDC00-\uDFFF]):Then, this regex could be used as follows:
However, matching printable strings, rather than searching for non-printable characters as the code currently does, would have the disadvantage of not identifying the culprit character (we wouldn't get the position and the actual non-printable character from a lack of a regex match).
Instead, we can modify the NON_PRINTABLE regex to allow legal surrogate pairs. We do this by removing surrogate pairs from the existing character set and adding the following options for illegal uses of surrogate code points:
Surrogate low that doesn't follow a surrogate high (either a surrogate low at the start of a string, or a surrogate low that follows a character that's not a surrogate high):
(?:^|[^\uD800-\uDBFF])[\uDC00-\uDFFF]Surrogate high that isn't followed by a surrogate low (either a surrogate high at the end of a string, or a surrogate high that is followed by a character that's not a surrogate low):
[\uD800-\uDBFF](?:[^\uDC00-\uDFFF]|$)The behavior of this modified regex should match the one that is used when Python is built with UCS-4 support.