Fix reader for Unicode code points over 0xFFFF by anishathalye · Pull Request #351 · yaml/pyyaml

anishathalye · 2019-11-20T20:02:13Z

Fixes #350

This patch fixes the handling of inputs with Unicode code points over 0xFFFF when running on a Python 2 that does not have UCS-4 support (which certain distributions still ship, e.g. macOS).

When Python is compiled without UCS-4 support, it uses UCS-2. In this situation, non-BMP Unicode characters, which have code points over 0xFFFF, are represented as surrogate pairs. For example, if we take u'\U0001f3d4', it will be represented as the surrogate pair u'\ud83c\udfd4'. This can be seen by running, for example:

[i for i in u'\U0001f3d4']

In PyYAML, the reader uses a function check_printable to validate inputs, making sure that they only contain printable characters. Prior to this patch, on UCS-2 builds, it incorrectly identified surrogate pairs as non-printable.

It would be fairly natural to write a regular expression that captures strings that contain only printable characters, as opposed to non-printable characters (as identified by the old code, so not excluding surrogate pairs):

PRINTABLE = re.compile(u'^[\x09\x0A\x0D\x20-\x7E\x85\xA0-\uD7FF\uE000-\uFFFD]*$')

Adding support for surrogate pairs to this would be straightforward, adding the option of having a surrogate high followed by a surrogate low ([\uD800-\uDBFF][\uDC00-\uDFFF]):

PRINTABLE = re.compile(u'^(?:[\x09\x0A\x0D\x20-\x7E\x85\xA0-\uD7FF\uE000-\uFFFD]|[\uD800-\uDBFF][\uDC00-\uDFFF])*$')

Then, this regex could be used as follows:

def check_printable(self, data):
    if not self.PRINTABLE.match(data):
        raise ReaderError(...)

However, matching printable strings, rather than searching for non-printable characters as the code currently does, would have the disadvantage of not identifying the culprit character (we wouldn't get the position and the actual non-printable character from a lack of a regex match).

Instead, we can modify the NON_PRINTABLE regex to allow legal surrogate pairs. We do this by removing surrogate pairs from the existing character set and adding the following options for illegal uses of surrogate code points:

Surrogate low that doesn't follow a surrogate high (either a surrogate low at the start of a string, or a surrogate low that follows a character that's not a surrogate high):

(?:^|[^\uD800-\uDBFF])[\uDC00-\uDFFF]
Surrogate high that isn't followed by a surrogate low (either a surrogate high at the end of a string, or a surrogate high that is followed by a character that's not a surrogate low):

[\uD800-\uDBFF](?:[^\uDC00-\uDFFF]|$)

The behavior of this modified regex should match the one that is used when Python is built with UCS-4 support.

This patch fixes the handling of inputs with Unicode code points over 0xFFFF when running on a Python 2 that does not have UCS-4 support (which certain distributions still ship, e.g. macOS). When Python is compiled without UCS-4 support, it uses UCS-2. In this situation, non-BMP Unicode characters, which have code points over 0xFFFF, are represented as surrogate pairs. For example, if we take u'\U0001f3d4', it will be represented as the surrogate pair u'\ud83c\udfd4'. This can be seen by running, for example: [i for i in u'\U0001f3d4'] In PyYAML, the reader uses a function `check_printable` to validate inputs, making sure that they only contain printable characters. Prior to this patch, on UCS-2 builds, it incorrectly identified surrogate pairs as non-printable. It would be fairly natural to write a regular expression that captures strings that contain only *printable* characters, as opposed to *non-printable* characters (as identified by the old code, so not excluding surrogate pairs): PRINTABLE = re.compile(u'^[\x09\x0A\x0D\x20-\x7E\x85\xA0-\uD7FF\uE000-\uFFFD]*$') Adding support for surrogate pairs to this would be straightforward, adding the option of having a surrogate high followed by a surrogate low (`[\uD800-\uDBFF][\uDC00-\uDFFF]`): PRINTABLE = re.compile(u'^(?:[\x09\x0A\x0D\x20-\x7E\x85\xA0-\uD7FF\uE000-\uFFFD]|[\uD800-\uDBFF][\uDC00-\uDFFF])*$') Then, this regex could be used as follows: def check_printable(self, data): if not self.PRINTABLE.match(data): raise ReaderError(...) However, matching printable strings, rather than searching for non-printable characters as the code currently does, would have the disadvantage of not identifying the culprit character (we wouldn't get the position and the actual non-printable character from a lack of a regex match). Instead, we can modify the NON_PRINTABLE regex to allow legal surrogate pairs. We do this by removing surrogate pairs from the existing character set and adding the following options for illegal uses of surrogate code points: - Surrogate low that doesn't follow a surrogate high (either a surrogate low at the start of a string, or a surrogate low that follows a character that's not a surrogate high): (?:^|[^\uD800-\uDBFF])[\uDC00-\uDFFF] - Surrogate high that isn't followed by a surrogate low (either a surrogate high at the end of a string, or a surrogate high that is followed by a character that's not a surrogate low): [\uD800-\uDBFF](?:[^\uDC00-\uDFFF]|$) The behavior of this modified regex should match the one that is used when Python is built with UCS-4 support.

perlpunk · 2019-11-20T20:35:21Z

Thanks!
I think I can start with enabling the tests we explicitly disabled for this and see if Appveyor tests suceed.
But I will be quite busy until the rest of the week.

anishathalye · 2019-11-20T22:06:53Z

Oh yeah, I was going to do that and see what happened, but I forgot. Enabling those tests as part of this PR would make sense.

The tests can be enabled as follows:

diff --git i/tests/lib/test_appliance.py w/tests/lib/test_appliance.py
index 5ec4575..7822979 100644
--- i/tests/lib/test_appliance.py
+++ w/tests/lib/test_appliance.py
@@ -27,8 +27,6 @@ def find_test_filenames(directory):
             base, ext = os.path.splitext(filename)
             if base.endswith('-py3'):
                 continue
-            if not has_ucs4 and base.find('-ucs4-') > -1:
-                continue
             filenames.setdefault(base, []).append(ext)
     filenames = filenames.items()
     filenames.sort()

(it might make sense to rename those *-ucs4-* files too.)

For comparison, here's what happens with the UCS-4 tests enabled for a non-UCS-4 Python 2 build, prior to this patch:

running test
running build_py
running build_ext
building '_yaml' extension
clang -fno-strict-aliasing -fno-common -dynamic -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/usr/include -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/System/Library/Frameworks/Tk.framework/Versions/8.5/Headers -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/usr/local/include -I/usr/local/opt/openssl/include -I/usr/local/opt/sqlite/include -I/usr/local/Cellar/python@2/2.7.16/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c ext/_yaml.c -o build/temp.macosx-10.14-x86_64-2.7/ext/_yaml.o
.............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................E.................................................................................................................................................................EE.......EE...........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................EE.......
===========================================================================
test_representer_types(tests/data/emitting-unacceptable-unicode-character-bug-ucs4-py2.code): ERROR
Traceback (most recent call last):
  File "tests/lib/test_appliance.py", line 67, in execute
    function(verbose=verbose, *filenames)
  File "tests/lib/test_representer.py", line 15, in test_representer_types
    native2 = yaml.load(output, Loader=test_constructor.MyLoader)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/__init__.py", line 112, in load
    loader = Loader(stream)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/loader.py", line 44, in __init__
    Reader.__init__(self, stream)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/reader.py", line 81, in __init__
    self.determine_encoding()
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/reader.py", line 137, in determine_encoding
    self.update(1)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/reader.py", line 174, in update
    self.check_printable(data)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/reader.py", line 149, in check_printable
    'unicode', "special characters are not allowed")
ReaderError: unacceptable character #xdd00: special characters are not allowed
  in "<string>", position 0
---------------------------------------------------------------------------
tests/data/emitting-unacceptable-unicode-character-bug-ucs4-py2.code:
u"\udd00"
===========================================================================
test_unicode_input(tests/data/emoticons-ucs4-.unicode): ERROR
Traceback (most recent call last):
  File "tests/lib/test_appliance.py", line 67, in execute
    function(verbose=verbose, *filenames)
  File "tests/lib/test_input_output.py", line 20, in test_unicode_input
    output = yaml.full_load(_unicode_open(StringIO.StringIO(data.encode('utf-8')), 'utf-8'))
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/__init__.py", line 142, in full_load
    return load(stream, FullLoader)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/__init__.py", line 112, in load
    loader = Loader(stream)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/loader.py", line 24, in __init__
    Reader.__init__(self, stream)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/reader.py", line 87, in __init__
    self.determine_encoding()
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/reader.py", line 137, in determine_encoding
    self.update(1)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/reader.py", line 174, in update
    self.check_printable(data)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/reader.py", line 149, in check_printable
    'unicode', "special characters are not allowed")
ReaderError: unacceptable character #xd83d: special characters are not allowed
  in "<file>", position 0
---------------------------------------------------------------------------
tests/data/emoticons-ucs4-.unicode:
😀😁😂😃😄😅😆😇
😈😉😊😋😌😍😎😏
😐😑😒😓😔😕😖😗
😘😙😚😛😜😝😞😟
😠😡😢😣😤😥😦😧
😨😩😪😫😬😭😮😯
😰😱😲😳😴😵😶😷
😸😹😺😻😼😽😾😿
🙀🙁🙂🙃🙄🙅🙆🙇
🙈🙉🙊🙋🙌🙍🙎🙏
===========================================================================
test_unicode_input(tests/data/emoticons2-ucs4-.unicode): ERROR
Traceback (most recent call last):
  File "tests/lib/test_appliance.py", line 67, in execute
    function(verbose=verbose, *filenames)
  File "tests/lib/test_input_output.py", line 20, in test_unicode_input
    output = yaml.full_load(_unicode_open(StringIO.StringIO(data.encode('utf-8')), 'utf-8'))
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/__init__.py", line 142, in full_load
    return load(stream, FullLoader)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/__init__.py", line 112, in load
    loader = Loader(stream)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/loader.py", line 24, in __init__
    Reader.__init__(self, stream)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/reader.py", line 87, in __init__
    self.determine_encoding()
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/reader.py", line 137, in determine_encoding
    self.update(1)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/reader.py", line 174, in update
    self.check_printable(data)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/reader.py", line 149, in check_printable
    'unicode', "special characters are not allowed")
ReaderError: unacceptable character #xd83d: special characters are not allowed
  in "<file>", position 0
---------------------------------------------------------------------------
tests/data/emoticons2-ucs4-.unicode:
😀
===========================================================================
test_unicode_transfer(tests/data/emoticons-ucs4-.unicode): ERROR
Traceback (most recent call last):
  File "tests/lib/test_appliance.py", line 67, in execute
    function(verbose=verbose, *filenames)
  File "tests/lib/test_input_output.py", line 123, in test_unicode_transfer
    output1 = yaml.emit(yaml.parse(input), allow_unicode=True)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/__init__.py", line 209, in emit
    for event in events:
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/__init__.py", line 73, in parse
    loader = Loader(stream)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/loader.py", line 44, in __init__
    Reader.__init__(self, stream)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/reader.py", line 76, in __init__
    self.check_printable(stream)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/reader.py", line 149, in check_printable
    'unicode', "special characters are not allowed")
ReaderError: unacceptable character #xd83d: special characters are not allowed
  in "<unicode string>", position 0
---------------------------------------------------------------------------
tests/data/emoticons-ucs4-.unicode:
😀😁😂😃😄😅😆😇
😈😉😊😋😌😍😎😏
😐😑😒😓😔😕😖😗
😘😙😚😛😜😝😞😟
😠😡😢😣😤😥😦😧
😨😩😪😫😬😭😮😯
😰😱😲😳😴😵😶😷
😸😹😺😻😼😽😾😿
🙀🙁🙂🙃🙄🙅🙆🙇
🙈🙉🙊🙋🙌🙍🙎🙏
===========================================================================
test_unicode_transfer(tests/data/emoticons2-ucs4-.unicode): ERROR
Traceback (most recent call last):
  File "tests/lib/test_appliance.py", line 67, in execute
    function(verbose=verbose, *filenames)
  File "tests/lib/test_input_output.py", line 123, in test_unicode_transfer
    output1 = yaml.emit(yaml.parse(input), allow_unicode=True)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/__init__.py", line 209, in emit
    for event in events:
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/__init__.py", line 73, in parse
    loader = Loader(stream)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/loader.py", line 44, in __init__
    Reader.__init__(self, stream)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/reader.py", line 76, in __init__
    self.check_printable(stream)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/reader.py", line 149, in check_printable
    'unicode', "special characters are not allowed")
ReaderError: unacceptable character #xd83d: special characters are not allowed
  in "<unicode string>", position 0
---------------------------------------------------------------------------
tests/data/emoticons2-ucs4-.unicode:
😀
===========================================================================
test_unicode_input_ext(tests/data/emoticons-ucs4-.unicode): ERROR
Traceback (most recent call last):
  File "tests/lib/test_appliance.py", line 67, in execute
    function(verbose=verbose, *filenames)
  File "tests/lib/test_yaml_ext.py", line 240, in wrapper
    function(*args, **kwds)
  File "tests/lib/test_input_output.py", line 20, in test_unicode_input
    output = yaml.full_load(_unicode_open(StringIO.StringIO(data.encode('utf-8')), 'utf-8'))
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/__init__.py", line 142, in full_load
    return load(stream, FullLoader)
  File "tests/lib/test_yaml_ext.py", line 30, in new_load
    return old_load(stream, Loader)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/__init__.py", line 112, in load
    loader = Loader(stream)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/loader.py", line 24, in __init__
    Reader.__init__(self, stream)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/reader.py", line 87, in __init__
    self.determine_encoding()
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/reader.py", line 137, in determine_encoding
    self.update(1)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/reader.py", line 174, in update
    self.check_printable(data)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/reader.py", line 149, in check_printable
    'unicode', "special characters are not allowed")
ReaderError: unacceptable character #xd83d: special characters are not allowed
  in "<file>", position 0
---------------------------------------------------------------------------
tests/data/emoticons-ucs4-.unicode:
😀😁😂😃😄😅😆😇
😈😉😊😋😌😍😎😏
😐😑😒😓😔😕😖😗
😘😙😚😛😜😝😞😟
😠😡😢😣😤😥😦😧
😨😩😪😫😬😭😮😯
😰😱😲😳😴😵😶😷
😸😹😺😻😼😽😾😿
🙀🙁🙂🙃🙄🙅🙆🙇
🙈🙉🙊🙋🙌🙍🙎🙏
===========================================================================
test_unicode_input_ext(tests/data/emoticons2-ucs4-.unicode): ERROR
Traceback (most recent call last):
  File "tests/lib/test_appliance.py", line 67, in execute
    function(verbose=verbose, *filenames)
  File "tests/lib/test_yaml_ext.py", line 240, in wrapper
    function(*args, **kwds)
  File "tests/lib/test_input_output.py", line 20, in test_unicode_input
    output = yaml.full_load(_unicode_open(StringIO.StringIO(data.encode('utf-8')), 'utf-8'))
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/__init__.py", line 142, in full_load
    return load(stream, FullLoader)
  File "tests/lib/test_yaml_ext.py", line 30, in new_load
    return old_load(stream, Loader)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/__init__.py", line 112, in load
    loader = Loader(stream)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/loader.py", line 24, in __init__
    Reader.__init__(self, stream)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/reader.py", line 87, in __init__
    self.determine_encoding()
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/reader.py", line 137, in determine_encoding
    self.update(1)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/reader.py", line 174, in update
    self.check_printable(data)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/reader.py", line 149, in check_printable
    'unicode', "special characters are not allowed")
ReaderError: unacceptable character #xd83d: special characters are not allowed
  in "<file>", position 0
---------------------------------------------------------------------------
tests/data/emoticons2-ucs4-.unicode:
😀
===========================================================================
TESTS: 2590
ERRORS: 7

Here's what happens with this patch:

running test
running build_py
copying lib/yaml/reader.py -> build/lib.macosx-10.14-x86_64-2.7/yaml
running build_ext
building '_yaml' extension
clang -fno-strict-aliasing -fno-common -dynamic -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/usr/include -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/System/Library/Frameworks/Tk.framework/Versions/8.5/Headers -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/usr/local/include -I/usr/local/opt/openssl/include -I/usr/local/opt/sqlite/include -I/usr/local/Cellar/python@2/2.7.16/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c ext/_yaml.c -o build/temp.macosx-10.14-x86_64-2.7/ext/_yaml.o
.............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................E................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
===========================================================================
test_representer_types(tests/data/emitting-unacceptable-unicode-character-bug-ucs4-py2.code): ERROR
Traceback (most recent call last):
  File "tests/lib/test_appliance.py", line 67, in execute
    function(verbose=verbose, *filenames)
  File "tests/lib/test_representer.py", line 15, in test_representer_types
    native2 = yaml.load(output, Loader=test_constructor.MyLoader)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/__init__.py", line 112, in load
    loader = Loader(stream)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/loader.py", line 44, in __init__
    Reader.__init__(self, stream)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/reader.py", line 81, in __init__
    self.determine_encoding()
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/reader.py", line 137, in determine_encoding
    self.update(1)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/reader.py", line 174, in update
    self.check_printable(data)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/reader.py", line 149, in check_printable
    'unicode', "special characters are not allowed")
ReaderError: unacceptable character #xdd00: special characters are not allowed
  in "<string>", position 0
---------------------------------------------------------------------------
tests/data/emitting-unacceptable-unicode-character-bug-ucs4-py2.code:
u"\udd00"
===========================================================================
TESTS: 2590
ERRORS: 1

The single test that's still failing involves encoding/decoding \udd00. I'm not sure what the expected behavior is here. Given that this is an invalid surrogate, shouldn't the serializer refuse to serialize it (or does the spec say it's undefined behavior, or something)?

perlpunk · 2019-11-25T22:53:50Z

We decided to get out a 5.2 as soon as possible and not add more PRs to it, but I hope we can do 5.3 soon after that and include this PR!

So far I haven't found out what the emitting-unacceptable-unicode-character-bug-ucs4-py2 test is doing. The commit log references a ticket but for the old ticket system...

anishathalye · 2019-11-26T14:43:24Z

Sounds good! Hopefully this can be included in 5.2.1, as it's more of a bugfix than a new feature?

Looks like that file was originally introduced in this commit: https://bitbucket.org/xi/pyyaml/commits/6c7c12bb8fe64828e400a5a6c0fb5dd9911392c3

It references this issue: https://bitbucket.org/xi/pyyaml/issues/11/

perlpunk · 2019-11-26T14:58:05Z

It references this issue: https://bitbucket.org/xi/pyyaml/issues/11/

Thanks for going through the commits, I hadn't saved the commit id after I searched for this :)

Actually it references this one, as bitbucket was introduced later:
https://web.archive.org/web/20100613112554/http://pyyaml.org/ticket/11

anishathalye · 2019-11-27T19:28:11Z

Ah that makes sense! The referenced issue on Bitbucket didn't really seem to match the commit

perlpunk · 2019-12-08T16:18:56Z

Thanks! merged to release/5.3
I will rename the tests (except the one we're not sure about)

After #351 the tests are working again

They were disabled in d6cbff6 After #351 the tests are working again

This patch fixes the handling of inputs with Unicode code points over 0xFFFF when running on a Python 2 that does not have UCS-4 support (which certain distributions still ship, e.g. macOS). When Python is compiled without UCS-4 support, it uses UCS-2. In this situation, non-BMP Unicode characters, which have code points over 0xFFFF, are represented as surrogate pairs. For example, if we take u'\U0001f3d4', it will be represented as the surrogate pair u'\ud83c\udfd4'. This can be seen by running, for example: [i for i in u'\U0001f3d4'] In PyYAML, the reader uses a function `check_printable` to validate inputs, making sure that they only contain printable characters. Prior to this patch, on UCS-2 builds, it incorrectly identified surrogate pairs as non-printable. It would be fairly natural to write a regular expression that captures strings that contain only *printable* characters, as opposed to *non-printable* characters (as identified by the old code, so not excluding surrogate pairs): PRINTABLE = re.compile(u'^[\x09\x0A\x0D\x20-\x7E\x85\xA0-\uD7FF\uE000-\uFFFD]*$') Adding support for surrogate pairs to this would be straightforward, adding the option of having a surrogate high followed by a surrogate low (`[\uD800-\uDBFF][\uDC00-\uDFFF]`): PRINTABLE = re.compile(u'^(?:[\x09\x0A\x0D\x20-\x7E\x85\xA0-\uD7FF\uE000-\uFFFD]|[\uD800-\uDBFF][\uDC00-\uDFFF])*$') Then, this regex could be used as follows: def check_printable(self, data): if not self.PRINTABLE.match(data): raise ReaderError(...) However, matching printable strings, rather than searching for non-printable characters as the code currently does, would have the disadvantage of not identifying the culprit character (we wouldn't get the position and the actual non-printable character from a lack of a regex match). Instead, we can modify the NON_PRINTABLE regex to allow legal surrogate pairs. We do this by removing surrogate pairs from the existing character set and adding the following options for illegal uses of surrogate code points: - Surrogate low that doesn't follow a surrogate high (either a surrogate low at the start of a string, or a surrogate low that follows a character that's not a surrogate high): (?:^|[^\uD800-\uDBFF])[\uDC00-\uDFFF] - Surrogate high that isn't followed by a surrogate low (either a surrogate high at the end of a string, or a surrogate high that is followed by a character that's not a surrogate low): [\uD800-\uDBFF](?:[^\uDC00-\uDFFF]|$) The behavior of this modified regex should match the one that is used when Python is built with UCS-4 support.

They were disabled in d6cbff6 After #351 the tests are working again

perlpunk · 2020-01-06T20:25:34Z

released https://pypi.org/project/PyYAML/5.3/

https://github.com/yaml/pyyaml/blob/d0d660d035905d9c49fc0f8dafb579d2cc68c0c8/CHANGES#L7 5.3.1 (2020-03-18) * yaml/pyyaml#386 -- Prevents arbitrary code execution during python/object/new constructor 5.3 (2020-01-06) * yaml/pyyaml#290 -- Use `is` instead of equality for comparing with `None` * yaml/pyyaml#270 -- fix typos and stylistic nit * yaml/pyyaml#309 -- Fix up small typo * yaml/pyyaml#161 -- Fix handling of __slots__ * yaml/pyyaml#358 -- Allow calling add_multi_constructor with None * yaml/pyyaml#285 -- Add use of safe_load() function in README * yaml/pyyaml#351 -- Fix reader for Unicode code points over 0xFFFF * yaml/pyyaml#360 -- Enable certain unicode tests when maxunicode not > 0xffff * yaml/pyyaml#359 -- Use full_load in yaml-highlight example * yaml/pyyaml#244 -- Document that PyYAML is implemented with Cython * yaml/pyyaml#329 -- Fix for Python 3.10 * yaml/pyyaml#310 -- increase size of index, line, and column fields * yaml/pyyaml#260 -- remove some unused imports * yaml/pyyaml#163 -- Create timezone-aware datetimes when parsed as such * yaml/pyyaml#363 -- Add tests for timezone 5.2 (2019-12-02) ------------------ * Repair incompatibilities introduced with 5.1. The default Loader was changed, but several methods like add_constructor still used the old default yaml/pyyaml#279 -- A more flexible fix for custom tag constructors yaml/pyyaml#287 -- Change default loader for yaml.add_constructor yaml/pyyaml#305 -- Change default loader for add_implicit_resolver, add_path_resolver * Make FullLoader safer by removing python/object/apply from the default FullLoader yaml/pyyaml#347 -- Move constructor for object/apply to UnsafeConstructor * Fix bug introduced in 5.1 where quoting went wrong on systems with sys.maxunicode <= 0xffff yaml/pyyaml#276 -- Fix logic for quoting special characters * Other PRs: yaml/pyyaml#280 -- Update CHANGES for 5.1

This was referenced Nov 20, 2019

Support for certain unicode literals depends on sys.maxunicode #350

Closed

Support extended character sets in install.conf.yaml 🐈 anishathalye/dotbot#205

Closed

perlpunk changed the base branch from master to release/5.3 December 8, 2019 16:17

perlpunk merged commit 474dfc4 into yaml:release/5.3 Dec 8, 2019

perlpunk added a commit that referenced this pull request Dec 8, 2019

Enable certain unicode tests when maxunicode not > 0xffff

9270f05

After #351 the tests are working again

perlpunk added a commit that referenced this pull request Dec 8, 2019

Enable certain unicode tests when maxunicode not > 0xffff

bfabdf4

They were disabled in d6cbff6 After #351 the tests are working again

perlpunk mentioned this pull request Dec 8, 2019

Enable certain unicode tests when maxunicode not > 0xffff #360

Merged

perlpunk added a commit that referenced this pull request Dec 20, 2019

Enable certain unicode tests when maxunicode not > 0xffff

a826f54

They were disabled in d6cbff6 After #351 the tests are working again

pekkaklarck mentioned this pull request Jan 7, 2020

PyYAML 5.3 not compatible with Jython #369

Closed

asherf mentioned this pull request Apr 29, 2020

Upgrade PyYAML 5.1.2 to 5.3.1 pantsbuild/pants#9649

Merged

mercywediah mentioned this pull request Mar 26, 2021

Bump pyyaml from 3.13 to 5.4 mawy7/gra-asokwa-tincenter#5

Open

sync-by-unito bot mentioned this pull request Apr 12, 2021

Bump pyyaml from 5.1.2 to 5.4 lwinterface/panini_microservice_template#133

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix reader for Unicode code points over 0xFFFF#351

Fix reader for Unicode code points over 0xFFFF#351
perlpunk merged 1 commit intoyaml:release/5.3from
anishathalye:fix-unicode-non-bmp-ucs-2

anishathalye commented Nov 20, 2019 •

edited

Loading

Uh oh!

perlpunk commented Nov 20, 2019

Uh oh!

anishathalye commented Nov 20, 2019

Uh oh!

perlpunk commented Nov 25, 2019

Uh oh!

anishathalye commented Nov 26, 2019 •

edited

Loading

Uh oh!

perlpunk commented Nov 26, 2019

Uh oh!

anishathalye commented Nov 27, 2019

Uh oh!

perlpunk commented Dec 8, 2019

Uh oh!

perlpunk commented Jan 6, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

anishathalye commented Nov 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

perlpunk commented Nov 20, 2019

Uh oh!

anishathalye commented Nov 20, 2019

Uh oh!

perlpunk commented Nov 25, 2019

Uh oh!

anishathalye commented Nov 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

perlpunk commented Nov 26, 2019

Uh oh!

anishathalye commented Nov 27, 2019

Uh oh!

perlpunk commented Dec 8, 2019

Uh oh!

perlpunk commented Jan 6, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

anishathalye commented Nov 20, 2019 •

edited

Loading

anishathalye commented Nov 26, 2019 •

edited

Loading