New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C API] Change PyUnicode_AsUTF8() to return NULL on embedded null characters #111089
Comments
* PyUnicode_AsUTF8() now raises an exception if the string contains embedded null characters. * PyUnicode_AsUTF8AndSize() now sets `*size` to 0 on error to avoid undefined variable value. * Update related C API tests (test_capi.test_unicode).
* PyUnicode_AsUTF8() now raises an exception if the string contains embedded null characters. * PyUnicode_AsUTF8AndSize() now sets `*size` to 0 on error to avoid undefined variable value. * Update related C API tests (test_capi.test_unicode). * type_new_set_doc() uses PyUnicode_AsUTF8AndSize() to silently truncate doc containing null bytes.
On error, PyUnicode_AsUTF8AndSize() now sets the size argument to 0, to avoid undefined value.
* PyUnicode_AsUTF8() now raises an exception if the string contains embedded null characters. * Update related C API tests (test_capi.test_unicode). * type_new_set_doc() uses PyUnicode_AsUTF8AndSize() to silently truncate doc containing null bytes.
On error, PyUnicode_AsUTF8AndSize() now sets the size argument to -1, to avoid undefined value.
* PyUnicode_AsUTF8() now raises an exception if the string contains embedded null characters. * Update related C API tests (test_capi.test_unicode). * type_new_set_doc() uses PyUnicode_AsUTF8AndSize() to silently truncate doc containing null bytes. Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
Add PyUnicode_AsUTF8() function to the limited C API. multiprocessing posixshmem now uses PyUnicode_AsUTF8() instead of PyUnicode_AsUTF8AndSize(): the extension is built with the limited C API. The function now raises an exception if the filename contains an embedded null character instead of truncating silently the filename.
PyUnicode_AsUTF8() now raises an exception if the string contains embedded null characters.
Add PyUnicode_AsUTF8() function to the limited C API. multiprocessing posixshmem now uses PyUnicode_AsUTF8() instead of PyUnicode_AsUTF8AndSize(): the extension is built with the limited C API. The function now raises an exception if the filename contains an embedded null character instead of truncating silently the filename.
On error, PyUnicode_AsUTF8AndSize() now sets the size argument to -1, to avoid undefined value.
PyUnicode_AsUTF8() now raises an exception if the string contains embedded null characters.
|
I added PyUnicode_AsUTF8() to the limited C API and it now raises an exception if the string contains embedded null characters. |
|
This makes |
If it's a performance issue, we can cache the "contains embedded null characters/bytes" information in PyASCIIObject structure.
Would you mind to elaborate?
It's a different API. In C, In C, it's rare that C code uses a length with a |
OK! Will you do that?
The API that existed with this name since before Python 3.0 is -- as you claim -- unsafe. The new semantics are not proven; usually we wait before the API stabilizes before declaring it stable for the long term. |
Replace PyUnicode_AsUTF8AndSize() with PyUnicode_AsUTF8() to remove the explicit check for embedded null characters. The change avoids to have to include explicitly <string.h> to get the strlen() function when using a recent version of the limited C API.
Add PyASCIIObject.state.embed_null member to Python str objects. It
is used as a cache by PyUnicode_AsUTF8() to only check once if a
string contains a null character. Strings created by
PyUnicode_FromString() initializes *embed_null* since the string
cannot contain a null character.
Global static strings now also initialize the *embed_null* member.
The chr(0) singleton ("\0" string) is the only static string which
contains a null character.
Add PyASCIIObject.state.embed_null member to Python str objects. It
is used as a cache by PyUnicode_AsUTF8() to only check once if a
string contains a null character. Strings created by
PyUnicode_FromString() initializes *embed_null* since the string
cannot contain a null character.
Global static strings now also initialize the *embed_null* member.
The chr(0) singleton ("\0" string) is the only static string which
contains a null character.
Add PyASCIIObject.state.embed_null member to Python str objects. It
is used as a cache by PyUnicode_AsUTF8() to only check once if a
string contains a null character. Strings created by
PyUnicode_FromString() initializes *embed_null* since the string
cannot contain a null character.
Global static strings now also initialize the *embed_null* member.
The chr(0) singleton ("\0" string) is the only static string which
contains a null character.
Fix unicode_subtype_new
|
While I understand the rationale behind this and wish the API had been designed that way when PyUnicode was AsUTF8 was created... It is too late. Changing Thus if we were to make this change, it should require a PEP 387 deprecation cycle - With the changed C APIs setting a warning (assuming no performance issues as with the current O(n) are introduced), and only making the behavior change to be an exception a few releases later. Setting warnings from within a C API is a bit weird, is that even going to be useful for people seeing them in tests to debug where the problem came from? Improve the documentation to advise people to use a length and validate the length before crossing over into C-string land anywhere security might matter is sufficient. No compelling serious reason to change these APIs out from underneath existing working code has been presented. So I'm going with "lets just not do this". There is still an API improvement we could make to make peoples safety minded lives easier without sacrificing performance: A new API could be added that simply returns a 0 or 1 if a PyUnicode object contains a null character (known at creation time via #111587 like mechanisms) for code that has reason to care on 3.13+ to use rather than calling an expensive Other reasons not to do this: It breaks symmetry with To be clear: This is me saying the changes made so far for this issue should be reverted. |
…1585) Replace PyUnicode_AsUTF8AndSize() with PyUnicode_AsUTF8() to remove the explicit check for embedded null characters. The change avoids to have to include explicitly <string.h> to get the strlen() function when using a recent version of the limited C API.
Replace PyUnicode_AsUTF8AndSize() with PyUnicode_AsUTF8() to remove the explicit check for embedded null characters.
Revert PyUnicode_AsUTF8() change: it no longer rejects embedded null characters: the PyUnicode_AsUTF8Safe() function should be used instead.
|
Looking at precedence, So yet one alternative is to add the check in The check for the embedded NULs was in I do not insist on this way, I only show yet one alternative. |
I would not be comfortable with emitting a DeprecationWarning. For me, we should change the API without deprecation period: either we make "the default" safe (reject null embedded characters), or we leave the default unchanged and we provide a new "safe" API. Here "the default" are Also @Yhg1s seem to like using PyUnicode_AsUTF8() to truncate strings with embedded null characters on purpose. It doesn't want the "default" to change. I'm not sure that a deprecation would help.
I proposed PyUnicode_AsUTF8Safe() instead in my new PR #111688. PyUnicode_ContainsNul() requires to raise an exception explicitly: it requires more lines of code, and so I'm afraid that people will not adopt such API and so write unsafe code because of lazyness. Moreover, I'm not sure about adding a specialized PyUnicode_FindChar() function just for the NUL character: the existing API can be used. If tomorrow an optimization is needed, PyUnicode_FindChar(0) should also use it. About performance, the correctness of my proposed PR #111587 ("embed_null" cache) is under discussions. Since the C API makes it technically possible to mutate string, and people well, actually use anything which is possible by the C API, it's hard to guarantee correctness and "prefill" the cache when a string is created. Moreover, the performance gain is also being discussed: so far, I didn't manage to convince @methane and @serhiy-storchaka on that part :-)
A bytes object is different: it's just an array of bytes (0..255 integers). PyUnicode_AsUTF8() is a complex operation: it encodes Unicode code points (characters) to UTF-8. The UTF-8 encoded string is different and longer if the string contains non-ASCII characters. For bytes, Moreover, I prefer to use Apparently, Well, as I wrote, I proposed PR #111688 to revert |
From what I understood, @gpshead and @Yhg1s concerns are more about for how long people rely on the exact C API. Since I think that any problem which was not explained previously is that a "string" itself has no specific purpose, it's very generic. The Perl programming language has a "tained string" feature to make a string as "security sensitive". But in Python, a string can be "anything": a person name, an URL, a filename, XML, well, whatever. The problem is to decide in the broad general case if a string is "security sensitive" or not. I proposed to make the assumption that all strings are security sensitive. @Yhg1s argues that no, in his use case, it's not the case and having to "pay a performance tax" of additional checks is not acceptable. Python was already fixed to reject null characters when it's obvious that the string is security sensitive: for filenames. Basically all filename operations reject null characters. Apparently, PyUnicode_AsUTF8() is "too broad" or "too generic" to be treated as "security sensitive". I'm not sure why The fact that |
In this case, why is the issue tagged "feature" instead of being sent to the security response team and handled under embargo? It's clearly not a real security issue, and so there is no excuse for breaking the function without a normal deprecation period. If you're not comfortable with deprecation, then don't break the function. |
|
In a similar issue with the Py_UNICODE cache, I added new private function |
Revert PyUnicode_AsUTF8() change: it no longer rejects embedded null characters: the PyUnicode_AsUTF8Safe() function should be used instead.
Revert PyUnicode_AsUTF8() change: it no longer rejects embedded null characters: the PyUnicode_AsUTF8Safe() function should be used instead.
…ython#111585)" This reverts commit d9b606b.
python#111091)" This reverts commit d731579.
…ython#111121)" This reverts commit d8f32be.
* Revert "gh-111089: Use PyUnicode_AsUTF8() in Argument Clinic (#111585)" This reverts commit d9b606b. * Revert "gh-111089: Use PyUnicode_AsUTF8() in getargs.c (#111620)" This reverts commit cde1071. * Revert "gh-111089: PyUnicode_AsUTF8() now raises on embedded NUL (#111091)" This reverts commit d731579. * Revert "gh-111089: Add PyUnicode_AsUTF8() to the limited C API (#111121)" This reverts commit d8f32be. * Revert "gh-111089: Use PyUnicode_AsUTF8() in sqlite3 (#111122)" This reverts commit 37e4e20.
|
There seems to be multiple disagreements on multiple things around this issue:
When I made these changes, they were approved by my peers and I thought that we were all on the same page on embedded null characters, since Python is already rejecting them in many places such as PyUnicode_AsWideCharString() for many years. But well, then read the discussions for details about disagreements. I reverted all changes (except of PR #111106 which is unrelated) in PR #111833. I close the issue for now. |
|
For reference I did attempt a test run of our huge "everything" style codebase at work over the weekend with That various things depend on the current (and now restored, thank you!) behavior, regardless of what any of us think of that, was what I wanted it to show. It did. :) |
Any idea what is their use case for embedded null characters? How are such strings build? By decoding bytes from UTF-8? |
|
Thanks for that feedback/test, it's very useful 👍 |
* Revert "pythongh-111089: Use PyUnicode_AsUTF8() in Argument Clinic (python#111585)" This reverts commit d9b606b. * Revert "pythongh-111089: Use PyUnicode_AsUTF8() in getargs.c (python#111620)" This reverts commit cde1071. * Revert "pythongh-111089: PyUnicode_AsUTF8() now raises on embedded NUL (python#111091)" This reverts commit d731579. * Revert "pythongh-111089: Add PyUnicode_AsUTF8() to the limited C API (python#111121)" This reverts commit d8f32be. * Revert "pythongh-111089: Use PyUnicode_AsUTF8() in sqlite3 (python#111122)" This reverts commit 37e4e20.
|
Thank you for the revert. Sorry that the issues were only discovered after merge, and after such heated discussion. Maybe there's some process improvement to be made. I filed this issue for the C-API-WG-to-be: capi-workgroup/api-evolution#39 |
I propose to change the
PyUnicode_AsUTF8()API to raise an exception and return NULL if the string contains embedded null characters.If the string contains an embedded null character, the UTF-8 encoded string can be truncated if used with C functions using
char*since a null byte is treated as the terminator: marker of the string end. Truncating a string silently is a bad practice and can lead to different bugs including security vulnerabilities.In practice, the minority of impacted C extensions and impacted users should benefit of such backward incompatible change, since truncating a string silently is a bad practice. Impacted users can use
PyUnicode_AsUTF8AndSize(obj, NULL)and just ignore the size if they want to truncate on purpose.It would address the following "hidden" comment on PyUnicode_AsUTF8():
PyUnicode_AsUTF8String() is part of the limited C API, whereas PyUnicode_AsUTF8() is not.
In the recently added PyUnicode_EqualToUTF8(obj, str), str is treated as not equal if obj contains embedded null characters.
The folllowing functions already raise an exception if the string contains embedded null characters or bytes:
PyUnicode_AsUTF8String() returns a bytes object and so the length, so it doesn't raise the exception.
PyUnicode_AsUTF8AndSize() also returns the size and so don't raise on embedded null characters.
Linked PRs
The text was updated successfully, but these errors were encountered: