Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C API] Change PyUnicode_AsUTF8() to return NULL on embedded null characters #111089

Closed
vstinner opened this issue Oct 19, 2023 · 22 comments
Closed
Assignees
Labels
topic-C-API type-feature A feature request or enhancement

Comments

@vstinner
Copy link
Member

vstinner commented Oct 19, 2023

I propose to change the PyUnicode_AsUTF8() API to raise an exception and return NULL if the string contains embedded null characters.

If the string contains an embedded null character, the UTF-8 encoded string can be truncated if used with C functions using char* since a null byte is treated as the terminator: marker of the string end. Truncating a string silently is a bad practice and can lead to different bugs including security vulnerabilities.

In practice, the minority of impacted C extensions and impacted users should benefit of such backward incompatible change, since truncating a string silently is a bad practice. Impacted users can use PyUnicode_AsUTF8AndSize(obj, NULL) and just ignore the size if they want to truncate on purpose.

It would address the following "hidden" comment on PyUnicode_AsUTF8():

Use of this API is DEPRECATED since no size information can be
extracted from the returned data.

PyUnicode_AsUTF8String() is part of the limited C API, whereas PyUnicode_AsUTF8() is not.

In the recently added PyUnicode_EqualToUTF8(obj, str), str is treated as not equal if obj contains embedded null characters.

The folllowing functions already raise an exception if the string contains embedded null characters or bytes:

  • PyUnicode_AsWideCharString()
  • PyUnicode_EncodeLocale()
  • PyUnicode_EncodeFSDefault()
  • PyUnicode_DecodeLocale(), PyUnicode_DecodeLocaleAndSize()
  • PyUnicode_DecodeFSDefaultAndSize()
  • PyUnicode_FSConverter()
  • PyUnicode_FSDecoder()

PyUnicode_AsUTF8String() returns a bytes object and so the length, so it doesn't raise the exception.

PyUnicode_AsUTF8AndSize() also returns the size and so don't raise on embedded null characters.

Linked PRs

vstinner added a commit to vstinner/cpython that referenced this issue Oct 19, 2023
* PyUnicode_AsUTF8() now raises an exception if the string contains
  embedded null characters.
* PyUnicode_AsUTF8AndSize() now sets `*size` to 0 on error to avoid
  undefined variable value.
* Update related C API tests (test_capi.test_unicode).
vstinner added a commit to vstinner/cpython that referenced this issue Oct 19, 2023
* PyUnicode_AsUTF8() now raises an exception if the string contains
  embedded null characters.
* PyUnicode_AsUTF8AndSize() now sets `*size` to 0 on error to avoid
  undefined variable value.
* Update related C API tests (test_capi.test_unicode).
* type_new_set_doc() uses PyUnicode_AsUTF8AndSize() to silently
  truncate doc containing null bytes.
@vstinner vstinner self-assigned this Oct 19, 2023
vstinner added a commit to vstinner/cpython that referenced this issue Oct 20, 2023
On error, PyUnicode_AsUTF8AndSize() now sets the size argument to 0,
to avoid undefined value.
vstinner added a commit to vstinner/cpython that referenced this issue Oct 20, 2023
* PyUnicode_AsUTF8() now raises an exception if the string contains
  embedded null characters.
* Update related C API tests (test_capi.test_unicode).
* type_new_set_doc() uses PyUnicode_AsUTF8AndSize() to silently
  truncate doc containing null bytes.
vstinner added a commit to vstinner/cpython that referenced this issue Oct 20, 2023
On error, PyUnicode_AsUTF8AndSize() now sets the size argument to -1,
to avoid undefined value.
vstinner added a commit that referenced this issue Oct 20, 2023
* PyUnicode_AsUTF8() now raises an exception if the string contains
  embedded null characters.
* Update related C API tests (test_capi.test_unicode).
* type_new_set_doc() uses PyUnicode_AsUTF8AndSize() to silently
  truncate doc containing null bytes.

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
vstinner added a commit to vstinner/cpython that referenced this issue Oct 20, 2023
Add PyUnicode_AsUTF8() function to the limited C API.

multiprocessing posixshmem now uses PyUnicode_AsUTF8() instead of
PyUnicode_AsUTF8AndSize(): the extension is built with the limited C
API. The function now raises an exception if the filename contains an
embedded null character instead of truncating silently the filename.
vstinner added a commit to vstinner/cpython that referenced this issue Oct 20, 2023
PyUnicode_AsUTF8() now raises an exception if the string contains
embedded null characters.
vstinner added a commit that referenced this issue Oct 20, 2023
Add PyUnicode_AsUTF8() function to the limited C API.

multiprocessing posixshmem now uses PyUnicode_AsUTF8() instead of
PyUnicode_AsUTF8AndSize(): the extension is built with the limited C
API. The function now raises an exception if the filename contains an
embedded null character instead of truncating silently the filename.
vstinner added a commit that referenced this issue Oct 20, 2023
On error, PyUnicode_AsUTF8AndSize() now sets the size argument to -1,
to avoid undefined value.
vstinner added a commit that referenced this issue Oct 20, 2023
PyUnicode_AsUTF8() now raises an exception if the string contains
embedded null characters.
@vstinner
Copy link
Member Author

I added PyUnicode_AsUTF8() to the limited C API and it now raises an exception if the string contains embedded null characters.

@encukou
Copy link
Member

encukou commented Oct 23, 2023

This makes PyUnicode_AsUTF8 O(N), as it needs to scan the entire string. I don't think that's a change to make lightly.
I also strongly believe that it's not a good fit for the limited API. The better alternative -- PyUnicode_AsUTF8AndSize is already there.

@encukou encukou reopened this Oct 23, 2023
@vstinner
Copy link
Member Author

This makes PyUnicode_AsUTF8 O(N), as it needs to scan the entire string.

If it's a performance issue, we can cache the "contains embedded null characters/bytes" information in PyASCIIObject structure.

I also strongly believe that it's not a good fit for the limited API.

Would you mind to elaborate?

The better alternative -- PyUnicode_AsUTF8AndSize is already there.

It's a different API. In C, char* with NUL terminator is used everywhere. Embedded NUL bytes is truncating strings which caused many bugs in the past.

In C, it's rare that C code uses a length with a char* string.

@encukou
Copy link
Member

encukou commented Oct 28, 2023

This makes PyUnicode_AsUTF8 O(N), as it needs to scan the entire string.
If it's a performance issue, we can cache the "contains embedded null characters/bytes" information in PyASCIIObject structure.

OK! Will you do that?
What are the performance trade-offs of that solution?

I also strongly believe that it's not a good fit for the limited API.

Would you mind to elaborate?

The API that existed with this name since before Python 3.0 is -- as you claim -- unsafe. The new semantics are not proven; usually we wait before the API stabilizes before declaring it stable for the long term.

vstinner added a commit to vstinner/cpython that referenced this issue Nov 1, 2023
Replace PyUnicode_AsUTF8AndSize() with PyUnicode_AsUTF8() to remove
the explicit check for embedded null characters.

The change avoids to have to include explicitly <string.h> to get the
strlen() function when using a recent version of the limited C API.
vstinner added a commit to vstinner/cpython that referenced this issue Nov 1, 2023
Add PyASCIIObject.state.embed_null member to Python str objects. It
is used as a cache by PyUnicode_AsUTF8() to only check once if a
string contains a null character. Strings created by
PyUnicode_FromString() initializes *embed_null* since the string
cannot contain a null character.

Global static strings now also initialize the *embed_null* member.
The chr(0) singleton ("\0" string) is the only static string which
contains a null character.
vstinner added a commit to vstinner/cpython that referenced this issue Nov 1, 2023
Add PyASCIIObject.state.embed_null member to Python str objects. It
is used as a cache by PyUnicode_AsUTF8() to only check once if a
string contains a null character. Strings created by
PyUnicode_FromString() initializes *embed_null* since the string
cannot contain a null character.

Global static strings now also initialize the *embed_null* member.
The chr(0) singleton ("\0" string) is the only static string which
contains a null character.
vstinner added a commit to vstinner/cpython that referenced this issue Nov 1, 2023
Add PyASCIIObject.state.embed_null member to Python str objects. It
is used as a cache by PyUnicode_AsUTF8() to only check once if a
string contains a null character. Strings created by
PyUnicode_FromString() initializes *embed_null* since the string
cannot contain a null character.

Global static strings now also initialize the *embed_null* member.
The chr(0) singleton ("\0" string) is the only static string which
contains a null character.
vstinner added a commit to vstinner/cpython that referenced this issue Nov 1, 2023
@gpshead
Copy link
Member

gpshead commented Nov 3, 2023

While I understand the rationale behind this and wish the API had been designed that way when PyUnicode was AsUTF8 was created... It is too late.

Changing AsUTF8 (or AsUTF8AndSize(str, NULL) as your draft PR does!) to return NULL with an exception set in situations where they did not in the past breaks our existing API contract.

Thus if we were to make this change, it should require a PEP 387 deprecation cycle - With the changed C APIs setting a warning (assuming no performance issues as with the current O(n) are introduced), and only making the behavior change to be an exception a few releases later. Setting warnings from within a C API is a bit weird, is that even going to be useful for people seeing them in tests to debug where the problem came from?

Improve the documentation to advise people to use a length and validate the length before crossing over into C-string land anywhere security might matter is sufficient. No compelling serious reason to change these APIs out from underneath existing working code has been presented. So I'm going with "lets just not do this".

There is still an API improvement we could make to make peoples safety minded lives easier without sacrificing performance: A new API could be added that simply returns a 0 or 1 if a PyUnicode object contains a null character (known at creation time via #111587 like mechanisms) for code that has reason to care on 3.13+ to use rather than calling an expensive strlen() to compare against the returned size as is required today. int PyUnicode_ContainsNul(PyObject*) would effectively be a cached precomputed version of its equivalent more complicated and costly PyUnicode_FindChar() call.

Other reasons not to do this: It breaks symmetry with PyBytes_FromString() which is what the API is trying to mimic for historical reasons. bytes can contain NUL characters. But we're never going to change that API to fail.

To be clear: This is me saying the changes made so far for this issue should be reverted.

FullteaR pushed a commit to FullteaR/cpython that referenced this issue Nov 3, 2023
…1585)

Replace PyUnicode_AsUTF8AndSize() with PyUnicode_AsUTF8() to remove
the explicit check for embedded null characters.

The change avoids to have to include explicitly <string.h> to get the
strlen() function when using a recent version of the limited C API.
FullteaR pushed a commit to FullteaR/cpython that referenced this issue Nov 3, 2023
Replace PyUnicode_AsUTF8AndSize() with PyUnicode_AsUTF8() to remove
the explicit check for embedded null characters.
vstinner added a commit to vstinner/cpython that referenced this issue Nov 3, 2023
Revert PyUnicode_AsUTF8() change: it no longer rejects embedded null
characters: the PyUnicode_AsUTF8Safe() function should be used
instead.
@serhiy-storchaka
Copy link
Member

Looking at precedence, PyBytes_AsStringAndSize() and PyUnicode_AsWideCharString() check for the embedded NULs when the output size pointer is NULL, but PyBytes_AsString() and PyUnicode_AsWideChar() do not.

So yet one alternative is to add the check in PyUnicode_AsUTF8AndSize() and remove it from PyUnicode_AsUTF8(). It keeps PyUnicode_AsUTF8() as a fast accessor to the internal UTF-8 cache and adds an explicit way to check for embedded NULs. PyUnicode_AsUTF8AndSize(u, NULL) is not what somebody would write intentionally if they can use PyUnicode_AsUTF8(), so it will affect less code.

The check for the embedded NULs was in PyBytes_AsStringAndSize() from the beginning (see d1ba443), and for PyUnicode_AsWideCharString() it was added in bpo-30708/#74893.

I do not insist on this way, I only show yet one alternative.

@vstinner
Copy link
Member Author

vstinner commented Nov 3, 2023

@gpshead:

Thus if we were to make this change, it should require a PEP 387 deprecation cycle - With the changed C APIs setting a warning (assuming no performance issues as with the current O(n) are introduced), and only making the behavior change to be an exception a few releases later.

I would not be comfortable with emitting a DeprecationWarning. For me, we should change the API without deprecation period: either we make "the default" safe (reject null embedded characters), or we leave the default unchanged and we provide a new "safe" API. Here "the default" are PyUnicode_AsUTF8(unicode) and PyUnicode_AsUTF8AndSize(unicode, NULL) functions. I'm afraid that a deprecation period would make the situation worse for everybody, I'm not convinced that it will make the situation better.

Also @Yhg1s seem to like using PyUnicode_AsUTF8() to truncate strings with embedded null characters on purpose. It doesn't want the "default" to change. I'm not sure that a deprecation would help.

int PyUnicode_ContainsNul(PyObject*) would effectively be a cached precomputed version of its equivalent more complicated and costly PyUnicode_FindChar() call.

I proposed PyUnicode_AsUTF8Safe() instead in my new PR #111688. PyUnicode_ContainsNul() requires to raise an exception explicitly: it requires more lines of code, and so I'm afraid that people will not adopt such API and so write unsafe code because of lazyness. Moreover, I'm not sure about adding a specialized PyUnicode_FindChar() function just for the NUL character: the existing API can be used. If tomorrow an optimization is needed, PyUnicode_FindChar(0) should also use it.

About performance, the correctness of my proposed PR #111587 ("embed_null" cache) is under discussions. Since the C API makes it technically possible to mutate string, and people well, actually use anything which is possible by the C API, it's hard to guarantee correctness and "prefill" the cache when a string is created. Moreover, the performance gain is also being discussed: so far, I didn't manage to convince @methane and @serhiy-storchaka on that part :-)

Other reasons not to do this: It breaks symmetry with PyBytes_FromString() which is what the API is trying to mimic for historical reasons. bytes can contain NUL characters. But we're never going to change that API to fail.

A bytes object is different: it's just an array of bytes (0..255 integers). PyUnicode_AsUTF8() is a complex operation: it encodes Unicode code points (characters) to UTF-8. The UTF-8 encoded string is different and longer if the string contains non-ASCII characters. For bytes, PyBytes_Size() gives you PyBytes_AsString() size, it's an O(1) operation: there is no conversion.

Moreover, PyBytes_AsStringAndSize(bytes, NULL) already raises a ValueError if the string contains null bytes ;-) It's a deliberate feature and many functions use this API to easily reject null bytes. Well, I dislike passing NULL to check for null bytes, I would prefer a separated API for that, but it's a different discussion.

I prefer to use PyUnicode_AsWideChar() example since it also converts a Python str object to a null terminated string: it's just that the format is different, wchar_t* type instead of char* type. PyUnicode_AsWideChar() was modified a long time ago and the change went fine.

Apparently, PyUnicode_AsUTF8() is too fundamental and too specific to be modified in Python 3.13. I understood that it would have make sense to change it in Python 3.0 when Unicode became the default string type, so when PyUnicode_AsUTF8() (previously called PyUnicode_AsString()) was the most natural way to pass a Python string to a C char* function. I understand the "breaking the contract" part. The sails has sailed.

Well, as I wrote, I proposed PR #111688 to revert PyUnicode_AsUTF8() to Python 3.12 behavior and propose a new "safe" API to convert a Python str object to UTF-8: PyUnicode_AsUTF8Safe(). (Let's discuss the proposed name in the PR ;-))

@vstinner
Copy link
Member Author

vstinner commented Nov 3, 2023

@serhiy-storchaka:

The check for the embedded NULs was in PyBytes_AsStringAndSize() from the beginning (see d1ba443), and for PyUnicode_AsWideCharString() it was added in bpo-30708/#74893.

From what I understood, @gpshead and @Yhg1s concerns are more about for how long people rely on the exact C API. Since PyUnicode_AsUTF8() is being used since Python 3.0 released in 2008: people expect the same behavior for 15 years, and so basically it's no longer possible to change the behavior.

I think that any problem which was not explained previously is that a "string" itself has no specific purpose, it's very generic. The Perl programming language has a "tained string" feature to make a string as "security sensitive". But in Python, a string can be "anything": a person name, an URL, a filename, XML, well, whatever.

The problem is to decide in the broad general case if a string is "security sensitive" or not. I proposed to make the assumption that all strings are security sensitive. @Yhg1s argues that no, in his use case, it's not the case and having to "pay a performance tax" of additional checks is not acceptable.

Python was already fixed to reject null characters when it's obvious that the string is security sensitive: for filenames. Basically all filename operations reject null characters.

Apparently, PyUnicode_AsUTF8() is "too broad" or "too generic" to be treated as "security sensitive".

I'm not sure why PyUnicode_AsWideCharString() could be treated differently. On Windows, using Windows native API requires to encode Python strings to 16-bit wchar_t* (UTF-16). Maybe rejecting null characters is fine in this case.

The fact that PyBytes_AsStringAndSize(bytes, NULL) reject null bytes means that there are functions which treat bytes strings as security sensitive and they need an API to reject null bytes. Well, as I wrote in my previous comment, I would prefer a separated API (such as PyBytes_AsStringSafe()) for that, but maybe now it's no longer needed since PyBytes_AsStringAndSize(bytes, NULL) already exists and is being used.

@Yhg1s
Copy link
Member

Yhg1s commented Nov 3, 2023

From what I understood, @gpshead and @Yhg1s concerns are more about for how long people rely on the exact C API.

No, how long the function has existed doesn't matter. Just that it exists. It's been in a released Python. People use it. And that's only one of my concerns.

@serhiy-storchaka
Copy link
Member

int PyUnicode_ContainsNul(PyObject*)

See _PyUnicode_HasNULChars(), added in 1334884 (bpo-13848/#58056), removed in 7e9d1d1 (bpo-23908/#68096). It was more convenient to embed the check in functions that return null-terminated strings.

@zooba
Copy link
Member

zooba commented Nov 3, 2023

I would not be comfortable with emitting a DeprecationWarning. For me, we should change the API without deprecation period

In this case, why is the issue tagged "feature" instead of being sent to the security response team and handled under embargo?

It's clearly not a real security issue, and so there is no excuse for breaking the function without a normal deprecation period. If you're not comfortable with deprecation, then don't break the function.

@serhiy-storchaka
Copy link
Member

In a similar issue with the Py_UNICODE cache, I added new private function _PyUnicode_AsUnicode() which was similar to PyUnicode_AsUnicode(), but checked for embedded NULs. It is because PyUnicode_AsUnicode() returned a pointer to a modifiable buffer for legacy strings which could contain uninitialized data. It was similar to PyBytes_AsString().

vstinner added a commit to vstinner/cpython that referenced this issue Nov 4, 2023
Revert PyUnicode_AsUTF8() change: it no longer rejects embedded null
characters: the PyUnicode_AsUTF8Safe() function should be used
instead.
vstinner added a commit to vstinner/cpython that referenced this issue Nov 7, 2023
Revert PyUnicode_AsUTF8() change: it no longer rejects embedded null
characters: the PyUnicode_AsUTF8Safe() function should be used
instead.
vstinner added a commit to vstinner/cpython that referenced this issue Nov 7, 2023
vstinner added a commit to vstinner/cpython that referenced this issue Nov 7, 2023
vstinner added a commit to vstinner/cpython that referenced this issue Nov 7, 2023
vstinner added a commit to vstinner/cpython that referenced this issue Nov 7, 2023
vstinner added a commit to vstinner/cpython that referenced this issue Nov 7, 2023
vstinner added a commit that referenced this issue Nov 7, 2023
* Revert "gh-111089: Use PyUnicode_AsUTF8() in Argument Clinic (#111585)"

This reverts commit d9b606b.

* Revert "gh-111089: Use PyUnicode_AsUTF8() in getargs.c (#111620)"

This reverts commit cde1071.

* Revert "gh-111089: PyUnicode_AsUTF8() now raises on embedded NUL (#111091)"

This reverts commit d731579.

* Revert "gh-111089: Add PyUnicode_AsUTF8() to the limited C API (#111121)"

This reverts commit d8f32be.

* Revert "gh-111089: Use PyUnicode_AsUTF8() in sqlite3 (#111122)"

This reverts commit 37e4e20.
@vstinner
Copy link
Member Author

vstinner commented Nov 7, 2023

There seems to be multiple disagreements on multiple things around this issue:

  • Disagreement if truncating a string on embedded null character is a bug, the expected behavior, or it's just ok we don't care.
  • Disagreement if truncating a string can lead to security issue or not -- see issue Embedded null characters can lead to bugs or even security vulnerabilities #111656 discussion.
  • Disagreement if PyUnicode_AsUTF8() should reject or not embedded null characters.
  • Disagreement if a new function rejecting embedded null characters is needed.
  • Disagreement if anything should be done, since C developers must be aware of security, must be aware of issues with embedded null characters, truncated is not the only security concern so why caring about this one anyway, etc.
  • Disagreement on who take decisions: @encukou is now asking the C API Working Group to take the decision, but the working group doesn't exist yet.

When I made these changes, they were approved by my peers and I thought that we were all on the same page on embedded null characters, since Python is already rejecting them in many places such as PyUnicode_AsWideCharString() for many years. But well, then read the discussions for details about disagreements.

I reverted all changes (except of PR #111106 which is unrelated) in PR #111833.

I close the issue for now.

@vstinner vstinner closed this as completed Nov 7, 2023
@gpshead
Copy link
Member

gpshead commented Nov 8, 2023

For reference I did attempt a test run of our huge "everything" style codebase at work over the weekend with AsUTF8 and AsUTF8AndSize(p, NULL) both set to return an error upon an embedded '\0' character. That test attempt didn't get far (auto-cancelled before it could run more due to the # of failures). Too many things were failing to even build because several Python tools used at build time triggered the error. https://github.com/eliben/pyelftools/ and https://github.com/fonttools/fonttools/ at least, I didn't try to construct a list and am not going to test further.

That various things depend on the current (and now restored, thank you!) behavior, regardless of what any of us think of that, was what I wanted it to show. It did. :)

@vstinner
Copy link
Member Author

vstinner commented Nov 8, 2023

because several Python tools used at build time triggered the error. https://github.com/eliben/pyelftools/ and https://github.com/fonttools/fonttools/ at least

Any idea what is their use case for embedded null characters? How are such strings build? By decoding bytes from UTF-8?

@vstinner
Copy link
Member Author

vstinner commented Nov 8, 2023

Thanks for that feedback/test, it's very useful 👍

hugovk pushed a commit to hugovk/cpython that referenced this issue Nov 8, 2023
* Revert "pythongh-111089: Use PyUnicode_AsUTF8() in Argument Clinic (python#111585)"

This reverts commit d9b606b.

* Revert "pythongh-111089: Use PyUnicode_AsUTF8() in getargs.c (python#111620)"

This reverts commit cde1071.

* Revert "pythongh-111089: PyUnicode_AsUTF8() now raises on embedded NUL (python#111091)"

This reverts commit d731579.

* Revert "pythongh-111089: Add PyUnicode_AsUTF8() to the limited C API (python#111121)"

This reverts commit d8f32be.

* Revert "pythongh-111089: Use PyUnicode_AsUTF8() in sqlite3 (python#111122)"

This reverts commit 37e4e20.
@encukou
Copy link
Member

encukou commented Nov 8, 2023

Thank you for the revert. Sorry that the issues were only discovered after merge, and after such heated discussion. Maybe there's some process improvement to be made.

I filed this issue for the C-API-WG-to-be: capi-workgroup/api-evolution#39
Feel free to re-frame the issue or propose other solutions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-C-API type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

6 participants