curl_multibyte: Resurrect local encoding fallbacks by jay · Pull Request #7421 · curl/curl

jay · 2021-07-17T07:10:20Z

This is an undo of 765e060 which removed local encoding fallbacks in
Windows Unicode builds.

Briefly, Windows Unicode builds are expected to use UTF-8 for file paths
internally however the user may pass paths in the current locale. Though
that is not expected for paths that are opened internally it may be
necessary for paths passed to dependencies that expect current locale
encoding. It is not documented which paths are expected to be passed in
UTF-8 encoding in Windows Unicode builds.

Prior to 765e060 the behavior was to convert the passed in string from
UTF-8 to UTF-16 and then pass that to the respective wide character file
function (eg _wfopen). If the conversion or the file function failed
then as a fallback it would try the passed in string with the regular
file function (eg fopen).

In comparison, the behavior in this change is check explicitly to see if
the string is valid UTF-8 before the conversion and only if it is not
then fall back on the respective file function that uses the current
locale.

The main difference is the file function failing no longer causes a
fallback to occur. That removes what could be considered a slight
security or stability issue since it could fail for any reason, such
as file does not exist.

Comments I made in 765e060 describing a security issue are incorrect.
Regardless of any fallback it is possible for the user to pass a string
in the current locale that is also valid UTF-8. The security issue IMO
is that it is not documented or well known which strings should be in
which encoding, except that as developers we know UTF-8 is used for file
paths internally (ie files opened by curl/libcurl) when Windows Unicode.

Closes #xxxx

This is an undo of 765e060 which removed local encoding fallbacks in Windows Unicode builds. Briefly, Windows Unicode builds are expected to use UTF-8 for file paths internally however the user may pass paths in the current locale. Though that is not expected for paths that are opened internally it may be necessary for paths passed to dependencies that expect current locale encoding. It is not documented which paths are expected to be passed in UTF-8 encoding in Windows Unicode builds. Prior to 765e060 the behavior was to convert the passed in string from UTF-8 to UTF-16 and then pass that to the respective wide character file function (eg _wfopen). If the conversion or the file function failed then as a fallback it would try the passed in string with the regular file function (eg fopen). In comparison, the behavior in this change is check explicitly to see if the string is valid UTF-8 before the conversion and only if it is not then fall back on the respective file function that uses the current locale. The main difference is the file function failing no longer causes a fallback to occur. That removes what could be considered a slight security or stability issue since it could fail for any reason, such as file does not exist. Comments I made in 765e060 describing a security issue are incorrect. Regardless of any fallback it is possible for the user to pass a string in the current locale that is also valid UTF-8. The security issue IMO is that it is not documented or well known which strings should be in which encoding, except that as developers we know UTF-8 is used for file paths internally (ie files opened by curl/libcurl) when Windows Unicode. Closes #xxxx

vszakats · 2021-07-18T23:23:03Z

Reading into the details, I'm confused. The patch does not restore the dynamic 8-bit fallback, which I agree with. (Coming from comment, I first assumed it'd be restored.)

But, what is the scenario when curlx_convert_UTF8_to_wchar() (which is effectively MultiByteToWideChar(CP_UTF8, MB_ERR_INVALID_CHARS, ...) would succeed, but curlx_is_str_utf8() would fail, or vice versa? What is the benefit of pre-validating the UTF-8 string with a custom local function, before passing it to Windows to convert it to UTF-16? Even in case these two wouldn't agree, why not just rely on Windows here?

jay · 2021-07-19T04:16:58Z

What is the dynamic fallback? I may have missed that.

The reason I wrote a separate function to check UTF-8 is because it's more accurate to have a function that returns an error only on bad UTF-8. Using MultiByteToWideChar to check could fail for other reasons, though I'd guess it's pretty unlikely.

vszakats · 2021-07-19T10:03:19Z

By "dynamic fallback" I meant the original (now removed in master) logic that will retry file operations with a different locale, in case it failed.

Windows API docs says that MultiByteToWideChar() will return ERROR_NO_UNICODE_TRANSLATION for invalid input, so catching this seems to cover that. Rest of the codes returned (ERROR_INSUFFICIENT_BUFFER, ERROR_INVALID_FLAGS, ERROR_INVALID_PARAMETER), of which none should never happen in the second/final call. (Unless the Windows env misses UTF-8 support, but I don't believe this can happen for a binary built against the UNICODE Windows API.)

jay · 2021-07-19T19:02:01Z

By "dynamic fallback" I meant the original (now removed in master) logic that will retry file operations with a different locale, in case it failed.

It only falls back now if the string is not valid UTF-8. The original behavior to retry valid UTF-8 in the current locale could lead to libcurl reading or writing to a file unintended, so this is not a full revert. Conversely it is also true that a string in the current locale could be valid UTF-8 and that could lead to reading or writing to a file unintended. As discussed elsewhere the doc will need to be improved to address this

#ifdef _UNICODE
if(is_utf8(filename)) {
convert to utf16
return wfopen
}
else
#endif
return fopen

vszakats · 2021-07-19T19:24:03Z

Oops, so my fear was right. I think this is a mistake and libs should not try to detect and decide what codepage the caller means. As I wrote elsewhere, an invalid UTF-8 still may have been intented as UTF-8 and thus should fail when used, also valid UTF-8 may well be a valid 8-bit string as well. Meaning it is not possible to auto-detect UTF-8 vs 8-bit codepages, and even less so the original intent of the caller.

I'm very strongly against committing such change. It takes away control from the user and the good intention will result in unexpected behaviour or worse case, a footgun.

If the goal is to maintain support for both UTF-8 and 8-bit codepages using the exact same API, the transparent and correct way of dealing with this is to let the user specify if "ANSI"/UNICODE should be passed/returned for non-ASCII libcurl strings. In UNICODE builds this can have a default that makes calls backwards compatible (accepting/returning 8-bit "ANSI"), while allowing to pass/return UTF-8 when explicitly enabled.

Though I'd argue that compatibility has already been broken a while ago when UNICODE builds became possible without paying attention to this, so making the default encoding UTF-8 in these builds is probably the cleaner solution.

jay · 2021-07-19T19:56:36Z

As I wrote elsewhere, an invalid UTF-8 still may have been intented as UTF-8 and thus should fail when used, also valid UTF-8 may well be a valid 8-bit string as well.

It's highly unlikely that invalid UTF-8 is intended as UTF-8, and highly likely that it's actually in the current locale given that it isn't documented in the options (and which options?) that UTF-8 is used for some strings (and which strings?) in Windows Unicode builds. There is too much I don't know about this and I think not having this fallback could break things (and which things?) for users passing non-ascii strings in those builds, which I again would assume are in current locale.

vszakats · 2021-07-19T20:31:01Z

I believe it is especially dangerous to introduce such feature if we don't understand the full extent of its consequences.

At the end it will make the system behave unexpectedly for every scenario. It's not a good result if libcurl misdetects a filename as UTF-8, tries to open it and fail, leaving libcurl without a CA store, pin or public/private key (or another dozen of similar security-related external file). While this may be unlikely (how unlikely? we don't know), it can be even worse: the user can end up with a different CA store/key/pin, making this exploitable. (Speaking of non-ASCII URLs, the situation is similarly bad, but maybe easier to grasp: who would want a libcurl heuristics to decide which URL to hit over the network?)

Given the undocumented past of string encoding/codepage requirements and the lack of much/any bug reports regarding this, I just don't see what such auto-detection effort is trying to resolve exactly. What is easier to see is that the proposed solution is heuristic and risky.

Since this area is/was undocumented, I believe the correct path is bringing it to a well-defined shape and document it. And in a well-defined shape the best is to leave out auto-detection and similar fallback heuristics.

jay · 2021-07-19T22:09:47Z

I disagree. We already had this behavior. It's not ideal but not putting it in the upcoming release is sure to break some things. What I don't know but unless someone else is in support of this we're going to find out.

vszakats · 2021-07-19T23:17:27Z

I think the best solution will be to stop creating UNICODE builds on Windows. They were broken from the start and with changes like this one, the situation will be even worse.

On closer inspection, the state of Unicode support in libcurl does not seem to be ready for production. Existing support extended certain Windows interfaces to use the Unicode flavour of the Windows API, but that also meant that the expected encoding/codepage of strings (e.g. local filenames, URLs) exchanged via the libcurl API became ambiguous and undefined. Previously all strings had to be passed in the active Windows locale, using an 8-bit codepage. In Unicode libcurl builds, the expected string encoding became an undocumented mixture of UTF-8 and 8-bit locale, depending on the actual API/option, certain dynamic and static "fallback" logic inside libcurl and even in OpenSSL, while some parts of libcurl kept using 8-bit strings internally. From the user's perspective this poses an unreasonably difficult task in finding out how to pass a certain non-ASCII string to a specific API without unwanted or accidental (possibly lossy) conversions or other side-effects. Missing the correct encoding may result in unexpected behaviour, e.g. in some cases not finding files, finding different files, accessing the wrong URL or passing a corrupt username or password. Note that these issues may _only_ affect strings with _non-ASCII_ content. For now the best solution seems to be to revert back to how libcurl/curl worked for most of its existence and only re-enable Unicode once the remaining parts of Windows Unicode support are well-understood, ironed out and documented. Unicode was enabled in curl-for-win about a year ago with 7.71.0. Hopefully this period had the benefit to have surfaced some of these issues. Ref: curl/curl#6089 Ref: curl/curl#7246 Ref: curl/curl#7251 Ref: curl/curl#7252 Ref: curl/curl#7257 Ref: curl/curl#7281 Ref: curl/curl#7421 Ref: https://github.com/curl/curl/wiki/libcurl-and-expected-string-encodings Ref: 8023ee5

On closer inspection, the state of Windows Unicode support in libcurl does not seem to be ready for production. Existing support extended certain Windows interfaces to use the Unicode flavour of the Windows API, but that also meant that the expected encoding/codepage of strings (e.g. local filenames, URLs) exchanged via the libcurl API became ambiguous and undefined. Previously all strings had to be passed in the active Windows locale, using an 8-bit codepage. In Unicode libcurl builds, the expected string encoding became an undocumented mixture of UTF-8 and 8-bit locale, depending on the actual API, build options/dependencies, internal fallback logic based on runtime auto-detection of passed string, and the result of file operations (scheduled for removal in 7.78.0). While some parts of libcurl kept using 8-bit strings internally, e.g. when reading the environment. From the user's perspective this poses an unreasonably complex task in finding out how to pass (or read) a certain non-ASCII string to (from) a specific API without unwanted or accidental conversions or other side-effects. Missing the correct encoding may result in unexpected behaviour, e.g. in some cases not finding files, reading/writing a different file, accessing the wrong URL or passing a corrupt username or password. Note that these issues may only affect strings with _non-7-bit-ASCII_ content. For now the least bad solution seems to be to revert back to how libcurl/curl worked for most of its existence and only re-enable Unicode once the remaining parts of Windows Unicode support are well-understood, ironed out and documented. Unicode was enabled in curl-for-win about a year ago with 7.71.0. Hopefully this period had the benefit to have surfaced some of these issues. Ref: curl/curl#6089 Ref: curl/curl#7246 Ref: curl/curl#7251 Ref: curl/curl#7252 Ref: curl/curl#7257 Ref: curl/curl#7281 Ref: curl/curl#7421 Ref: https://github.com/curl/curl/wiki/libcurl-and-expected-string-encodings Ref: 8023ee5

jay · 2021-09-24T05:55:51Z

Closing because no problem reports since the time 7.78 was released without the fallbacks.

jay added the Windows Windows-specific label Jul 17, 2021

jay mentioned this pull request Jul 17, 2021

tool_homedir: Fix GetEnv for Windows Unicode builds #7281

Closed

vszakats mentioned this pull request Jul 20, 2021

curl: revert to non-Unicode builds [ci skip] curl/curl-for-win#20

Merged

jay added the on-hold label Jul 20, 2021

jay closed this Sep 24, 2021

vszakats added the Unicode Unicode, code page, character encoding label Feb 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

curl_multibyte: Resurrect local encoding fallbacks#7421

curl_multibyte: Resurrect local encoding fallbacks#7421
jay wants to merge 1 commit intocurl:masterfrom
jay:resurrect_local_fallback

jay commented Jul 17, 2021

Uh oh!

vszakats commented Jul 18, 2021

Uh oh!

jay commented Jul 19, 2021

Uh oh!

vszakats commented Jul 19, 2021

Uh oh!

jay commented Jul 19, 2021

Uh oh!

vszakats commented Jul 19, 2021 •

edited

Loading

Uh oh!

jay commented Jul 19, 2021

Uh oh!

vszakats commented Jul 19, 2021 •

edited

Loading

Uh oh!

jay commented Jul 19, 2021

Uh oh!

vszakats commented Jul 19, 2021

Uh oh!

jay commented Sep 24, 2021

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

jay commented Jul 17, 2021

Uh oh!

vszakats commented Jul 18, 2021

Uh oh!

jay commented Jul 19, 2021

Uh oh!

vszakats commented Jul 19, 2021

Uh oh!

jay commented Jul 19, 2021

Uh oh!

vszakats commented Jul 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jay commented Jul 19, 2021

Uh oh!

vszakats commented Jul 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jay commented Jul 19, 2021

Uh oh!

vszakats commented Jul 19, 2021

Uh oh!

jay commented Sep 24, 2021

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

vszakats commented Jul 19, 2021 •

edited

Loading

vszakats commented Jul 19, 2021 •

edited

Loading