Skip to content

curl_multibyte: Resurrect local encoding fallbacks#7421

Closed
jay wants to merge 1 commit intocurl:masterfrom
jay:resurrect_local_fallback
Closed

curl_multibyte: Resurrect local encoding fallbacks#7421
jay wants to merge 1 commit intocurl:masterfrom
jay:resurrect_local_fallback

Conversation

@jay
Copy link
Member

@jay jay commented Jul 17, 2021

This is an undo of 765e060 which removed local encoding fallbacks in
Windows Unicode builds.

Briefly, Windows Unicode builds are expected to use UTF-8 for file paths
internally however the user may pass paths in the current locale. Though
that is not expected for paths that are opened internally it may be
necessary for paths passed to dependencies that expect current locale
encoding. It is not documented which paths are expected to be passed in
UTF-8 encoding in Windows Unicode builds.

Prior to 765e060 the behavior was to convert the passed in string from
UTF-8 to UTF-16 and then pass that to the respective wide character file
function (eg _wfopen). If the conversion or the file function failed
then as a fallback it would try the passed in string with the regular
file function (eg fopen).

In comparison, the behavior in this change is check explicitly to see if
the string is valid UTF-8 before the conversion and only if it is not
then fall back on the respective file function that uses the current
locale.

The main difference is the file function failing no longer causes a
fallback to occur. That removes what could be considered a slight
security or stability issue since it could fail for any reason, such
as file does not exist.

Comments I made in 765e060 describing a security issue are incorrect.
Regardless of any fallback it is possible for the user to pass a string
in the current locale that is also valid UTF-8. The security issue IMO
is that it is not documented or well known which strings should be in
which encoding, except that as developers we know UTF-8 is used for file
paths internally (ie files opened by curl/libcurl) when Windows Unicode.

Closes #xxxx

This is an undo of 765e060 which removed local encoding fallbacks in
Windows Unicode builds.

Briefly, Windows Unicode builds are expected to use UTF-8 for file paths
internally however the user may pass paths in the current locale. Though
that is not expected for paths that are opened internally it may be
necessary for paths passed to dependencies that expect current locale
encoding. It is not documented which paths are expected to be passed in
UTF-8 encoding in Windows Unicode builds.

Prior to 765e060 the behavior was to convert the passed in string from
UTF-8 to UTF-16 and then pass that to the respective wide character file
function (eg _wfopen). If the conversion or the file function failed
then as a fallback it would try the passed in string with the regular
file function (eg fopen).

In comparison, the behavior in this change is check explicitly to see if
the string is valid UTF-8 before the conversion and only if it is not
then fall back on the respective file function that uses the current
locale.

The main difference is the file function failing no longer causes a
fallback to occur. That removes what could be considered a slight
security or stability issue since it could fail for any reason, such
as file does not exist.

Comments I made in 765e060 describing a security issue are incorrect.
Regardless of any fallback it is possible for the user to pass a string
in the current locale that is also valid UTF-8. The security issue IMO
is that it is not documented or well known which strings should be in
which encoding, except that as developers we know UTF-8 is used for file
paths internally (ie files opened by curl/libcurl) when Windows Unicode.

Closes #xxxx
@jay jay added the Windows Windows-specific label Jul 17, 2021
@vszakats
Copy link
Member

Reading into the details, I'm confused. The patch does not restore the dynamic 8-bit fallback, which I agree with. (Coming from comment, I first assumed it'd be restored.)

But, what is the scenario when curlx_convert_UTF8_to_wchar() (which is effectively MultiByteToWideChar(CP_UTF8, MB_ERR_INVALID_CHARS, ...) would succeed, but curlx_is_str_utf8() would fail, or vice versa? What is the benefit of pre-validating the UTF-8 string with a custom local function, before passing it to Windows to convert it to UTF-16? Even in case these two wouldn't agree, why not just rely on Windows here?

@jay
Copy link
Member Author

jay commented Jul 19, 2021

What is the dynamic fallback? I may have missed that.

The reason I wrote a separate function to check UTF-8 is because it's more accurate to have a function that returns an error only on bad UTF-8. Using MultiByteToWideChar to check could fail for other reasons, though I'd guess it's pretty unlikely.

@vszakats
Copy link
Member

By "dynamic fallback" I meant the original (now removed in master) logic that will retry file operations with a different locale, in case it failed.

Windows API docs says that MultiByteToWideChar() will return ERROR_NO_UNICODE_TRANSLATION for invalid input, so catching this seems to cover that. Rest of the codes returned (ERROR_INSUFFICIENT_BUFFER, ERROR_INVALID_FLAGS, ERROR_INVALID_PARAMETER), of which none should never happen in the second/final call. (Unless the Windows env misses UTF-8 support, but I don't believe this can happen for a binary built against the UNICODE Windows API.)

@jay
Copy link
Member Author

jay commented Jul 19, 2021

By "dynamic fallback" I meant the original (now removed in master) logic that will retry file operations with a different locale, in case it failed.

It only falls back now if the string is not valid UTF-8. The original behavior to retry valid UTF-8 in the current locale could lead to libcurl reading or writing to a file unintended, so this is not a full revert. Conversely it is also true that a string in the current locale could be valid UTF-8 and that could lead to reading or writing to a file unintended. As discussed elsewhere the doc will need to be improved to address this

#ifdef _UNICODE
if(is_utf8(filename)) {
convert to utf16
return wfopen
}
else
#endif
return fopen

@vszakats
Copy link
Member

vszakats commented Jul 19, 2021

Oops, so my fear was right. I think this is a mistake and libs should not try to detect and decide what codepage the caller means. As I wrote elsewhere, an invalid UTF-8 still may have been intented as UTF-8 and thus should fail when used, also valid UTF-8 may well be a valid 8-bit string as well. Meaning it is not possible to auto-detect UTF-8 vs 8-bit codepages, and even less so the original intent of the caller.

I'm very strongly against committing such change. It takes away control from the user and the good intention will result in unexpected behaviour or worse case, a footgun.

If the goal is to maintain support for both UTF-8 and 8-bit codepages using the exact same API, the transparent and correct way of dealing with this is to let the user specify if "ANSI"/UNICODE should be passed/returned for non-ASCII libcurl strings. In UNICODE builds this can have a default that makes calls backwards compatible (accepting/returning 8-bit "ANSI"), while allowing to pass/return UTF-8 when explicitly enabled.

Though I'd argue that compatibility has already been broken a while ago when UNICODE builds became possible without paying attention to this, so making the default encoding UTF-8 in these builds is probably the cleaner solution.

@jay
Copy link
Member Author

jay commented Jul 19, 2021

As I wrote elsewhere, an invalid UTF-8 still may have been intented as UTF-8 and thus should fail when used, also valid UTF-8 may well be a valid 8-bit string as well.

It's highly unlikely that invalid UTF-8 is intended as UTF-8, and highly likely that it's actually in the current locale given that it isn't documented in the options (and which options?) that UTF-8 is used for some strings (and which strings?) in Windows Unicode builds. There is too much I don't know about this and I think not having this fallback could break things (and which things?) for users passing non-ascii strings in those builds, which I again would assume are in current locale.

@vszakats
Copy link
Member

vszakats commented Jul 19, 2021

I believe it is especially dangerous to introduce such feature if we don't understand the full extent of its consequences.

At the end it will make the system behave unexpectedly for every scenario. It's not a good result if libcurl misdetects a filename as UTF-8, tries to open it and fail, leaving libcurl without a CA store, pin or public/private key (or another dozen of similar security-related external file). While this may be unlikely (how unlikely? we don't know), it can be even worse: the user can end up with a different CA store/key/pin, making this exploitable. (Speaking of non-ASCII URLs, the situation is similarly bad, but maybe easier to grasp: who would want a libcurl heuristics to decide which URL to hit over the network?)

Given the undocumented past of string encoding/codepage requirements and the lack of much/any bug reports regarding this, I just don't see what such auto-detection effort is trying to resolve exactly. What is easier to see is that the proposed solution is heuristic and risky.

Since this area is/was undocumented, I believe the correct path is bringing it to a well-defined shape and document it. And in a well-defined shape the best is to leave out auto-detection and similar fallback heuristics.

@jay
Copy link
Member Author

jay commented Jul 19, 2021

I disagree. We already had this behavior. It's not ideal but not putting it in the upcoming release is sure to break some things. What I don't know but unless someone else is in support of this we're going to find out.

@vszakats
Copy link
Member

I think the best solution will be to stop creating UNICODE builds on Windows. They were broken from the start and with changes like this one, the situation will be even worse.

vszakats added a commit to curl/curl-for-win that referenced this pull request Jul 20, 2021
On closer inspection, the state of Unicode support in libcurl does not
seem to be ready for production. Existing support extended certain Windows
interfaces to use the Unicode flavour of the Windows API, but that also
meant that the expected encoding/codepage of strings (e.g. local filenames,
URLs) exchanged via the libcurl API became ambiguous and undefined.
Previously all strings had to be passed in the active Windows locale, using
an 8-bit codepage. In Unicode libcurl builds, the expected string encoding
became an undocumented mixture of UTF-8 and 8-bit locale, depending on the
actual API/option, certain dynamic and static "fallback" logic inside
libcurl and even in OpenSSL, while some parts of libcurl kept using 8-bit
strings internally. From the user's perspective this poses an unreasonably
difficult task in finding out how to pass a certain non-ASCII string to a
specific API without unwanted or accidental (possibly lossy) conversions or
other side-effects. Missing the correct encoding may result in unexpected
behaviour, e.g. in some cases not finding files, finding different files,
accessing the wrong URL or passing a corrupt username or password.

Note that these issues may _only_ affect strings with _non-ASCII_ content.

For now the best solution seems to be to revert back to how libcurl/curl
worked for most of its existence and only re-enable Unicode once the
remaining parts of Windows Unicode support are well-understood, ironed out
and documented.

Unicode was enabled in curl-for-win about a year ago with 7.71.0. Hopefully
this period had the benefit to have surfaced some of these issues.

Ref: curl/curl#6089
Ref: curl/curl#7246
Ref: curl/curl#7251
Ref: curl/curl#7252
Ref: curl/curl#7257
Ref: curl/curl#7281
Ref: curl/curl#7421
Ref: https://github.com/curl/curl/wiki/libcurl-and-expected-string-encodings
Ref: 8023ee5
vszakats added a commit to curl/curl-for-win that referenced this pull request Jul 20, 2021
On closer inspection, the state of Unicode support in libcurl does not
seem to be ready for production. Existing support extended certain Windows
interfaces to use the Unicode flavour of the Windows API, but that also
meant that the expected encoding/codepage of strings (e.g. local filenames,
URLs) exchanged via the libcurl API became ambiguous and undefined.
Previously all strings had to be passed in the active Windows locale, using
an 8-bit codepage. In Unicode libcurl builds, the expected string encoding
became an undocumented mixture of UTF-8 and 8-bit locale, depending on the
actual API/option, certain dynamic and static "fallback" logic inside
libcurl and even in OpenSSL, while some parts of libcurl kept using 8-bit
strings internally. From the user's perspective this poses an unreasonably
difficult task in finding out how to pass a certain non-ASCII string to a
specific API without unwanted or accidental (possibly lossy) conversions or
other side-effects. Missing the correct encoding may result in unexpected
behaviour, e.g. in some cases not finding files, finding different files,
accessing the wrong URL or passing a corrupt username or password.

Note that these issues may _only_ affect strings with _non-ASCII_ content.

For now the best solution seems to be to revert back to how libcurl/curl
worked for most of its existence and only re-enable Unicode once the
remaining parts of Windows Unicode support are well-understood, ironed out
and documented.

Unicode was enabled in curl-for-win about a year ago with 7.71.0. Hopefully
this period had the benefit to have surfaced some of these issues.

Ref: curl/curl#6089
Ref: curl/curl#7246
Ref: curl/curl#7251
Ref: curl/curl#7252
Ref: curl/curl#7257
Ref: curl/curl#7281
Ref: curl/curl#7421
Ref: https://github.com/curl/curl/wiki/libcurl-and-expected-string-encodings
Ref: 8023ee5
vszakats added a commit to curl/curl-for-win that referenced this pull request Jul 20, 2021
On closer inspection, the state of Unicode support in libcurl does not
seem to be ready for production. Existing support extended certain Windows
interfaces to use the Unicode flavour of the Windows API, but that also
meant that the expected encoding/codepage of strings (e.g. local filenames,
URLs) exchanged via the libcurl API became ambiguous and undefined.
Previously all strings had to be passed in the active Windows locale, using
an 8-bit codepage. In Unicode libcurl builds, the expected string encoding
became an undocumented mixture of UTF-8 and 8-bit locale, depending on the
actual API/option, certain dynamic and static "fallback" logic inside
libcurl and even in OpenSSL, while some parts of libcurl kept using 8-bit
strings internally. From the user's perspective this poses an unreasonably
difficult task in finding out how to pass a certain non-ASCII string to a
specific API without unwanted or accidental (possibly lossy) conversions or
other side-effects. Missing the correct encoding may result in unexpected
behaviour, e.g. in some cases not finding files, finding different files,
accessing the wrong URL or passing a corrupt username or password.

Note that these issues may _only_ affect strings with _non-ASCII_ content.

For now the best solution seems to be to revert back to how libcurl/curl
worked for most of its existence and only re-enable Unicode once the
remaining parts of Windows Unicode support are well-understood, ironed out
and documented.

Unicode was enabled in curl-for-win about a year ago with 7.71.0. Hopefully
this period had the benefit to have surfaced some of these issues.

Ref: curl/curl#6089
Ref: curl/curl#7246
Ref: curl/curl#7251
Ref: curl/curl#7252
Ref: curl/curl#7257
Ref: curl/curl#7281
Ref: curl/curl#7421
Ref: https://github.com/curl/curl/wiki/libcurl-and-expected-string-encodings
Ref: 8023ee5
vszakats added a commit to curl/curl-for-win that referenced this pull request Jul 20, 2021
On closer inspection, the state of Windows Unicode support in libcurl does
not seem to be ready for production. Existing support extended certain
Windows interfaces to use the Unicode flavour of the Windows API, but that
also meant that the expected encoding/codepage of strings (e.g. local
filenames, URLs) exchanged via the libcurl API became ambiguous and
undefined.

Previously all strings had to be passed in the active Windows locale, using
an 8-bit codepage. In Unicode libcurl builds, the expected string encoding
became an undocumented mixture of UTF-8 and 8-bit locale, depending on the
actual API, build options/dependencies, internal fallback logic based on
runtime auto-detection of passed string, and the result of file operations
(scheduled for removal in 7.78.0). While some parts of libcurl kept using
8-bit strings internally, e.g. when reading the environment.

From the user's perspective this poses an unreasonably complex task in
finding out how to pass (or read) a certain non-ASCII string to (from) a
specific API without unwanted or accidental conversions or other
side-effects. Missing the correct encoding may result in unexpected
behaviour, e.g. in some cases not finding files, reading/writing a
different file, accessing the wrong URL or passing a corrupt username or
password.

Note that these issues may only affect strings with _non-7-bit-ASCII_
content.

For now the least bad solution seems to be to revert back to how
libcurl/curl worked for most of its existence and only re-enable Unicode
once the remaining parts of Windows Unicode support are well-understood,
ironed out and documented.

Unicode was enabled in curl-for-win about a year ago with 7.71.0. Hopefully
this period had the benefit to have surfaced some of these issues.

Ref: curl/curl#6089
Ref: curl/curl#7246
Ref: curl/curl#7251
Ref: curl/curl#7252
Ref: curl/curl#7257
Ref: curl/curl#7281
Ref: curl/curl#7421
Ref: https://github.com/curl/curl/wiki/libcurl-and-expected-string-encodings
Ref: 8023ee5
@jay jay added the on-hold label Jul 20, 2021
vszakats added a commit to curl/curl-for-win that referenced this pull request Jul 20, 2021
On closer inspection, the state of Windows Unicode support in libcurl does
not seem to be ready for production. Existing support extended certain
Windows interfaces to use the Unicode flavour of the Windows API, but that
also meant that the expected encoding/codepage of strings (e.g. local
filenames, URLs) exchanged via the libcurl API became ambiguous and
undefined.

Previously all strings had to be passed in the active Windows locale, using
an 8-bit codepage. In Unicode libcurl builds, the expected string encoding
became an undocumented mixture of UTF-8 and 8-bit locale, depending on the
actual API, build options/dependencies, internal fallback logic based on
runtime auto-detection of passed string, and the result of file operations
(scheduled for removal in 7.78.0). While some parts of libcurl kept using
8-bit strings internally, e.g. when reading the environment.

From the user's perspective this poses an unreasonably complex task in
finding out how to pass (or read) a certain non-ASCII string to (from) a
specific API without unwanted or accidental conversions or other
side-effects. Missing the correct encoding may result in unexpected
behaviour, e.g. in some cases not finding files, reading/writing a
different file, accessing the wrong URL or passing a corrupt username or
password.

Note that these issues may only affect strings with _non-7-bit-ASCII_
content.

For now the least bad solution seems to be to revert back to how
libcurl/curl worked for most of its existence and only re-enable Unicode
once the remaining parts of Windows Unicode support are well-understood,
ironed out and documented.

Unicode was enabled in curl-for-win about a year ago with 7.71.0. Hopefully
this period had the benefit to have surfaced some of these issues.

Ref: curl/curl#6089
Ref: curl/curl#7246
Ref: curl/curl#7251
Ref: curl/curl#7252
Ref: curl/curl#7257
Ref: curl/curl#7281
Ref: curl/curl#7421
Ref: https://github.com/curl/curl/wiki/libcurl-and-expected-string-encodings
Ref: 8023ee5
@jay
Copy link
Member Author

jay commented Sep 24, 2021

Closing because no problem reports since the time 7.78 was released without the fallbacks.

@jay jay closed this Sep 24, 2021
@vszakats vszakats added the Unicode Unicode, code page, character encoding label Feb 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

on-hold Unicode Unicode, code page, character encoding Windows Windows-specific

Development

Successfully merging this pull request may close these issues.

2 participants