curl_multibyte: Resurrect local encoding fallbacks#7421
curl_multibyte: Resurrect local encoding fallbacks#7421jay wants to merge 1 commit intocurl:masterfrom
Conversation
This is an undo of 765e060 which removed local encoding fallbacks in Windows Unicode builds. Briefly, Windows Unicode builds are expected to use UTF-8 for file paths internally however the user may pass paths in the current locale. Though that is not expected for paths that are opened internally it may be necessary for paths passed to dependencies that expect current locale encoding. It is not documented which paths are expected to be passed in UTF-8 encoding in Windows Unicode builds. Prior to 765e060 the behavior was to convert the passed in string from UTF-8 to UTF-16 and then pass that to the respective wide character file function (eg _wfopen). If the conversion or the file function failed then as a fallback it would try the passed in string with the regular file function (eg fopen). In comparison, the behavior in this change is check explicitly to see if the string is valid UTF-8 before the conversion and only if it is not then fall back on the respective file function that uses the current locale. The main difference is the file function failing no longer causes a fallback to occur. That removes what could be considered a slight security or stability issue since it could fail for any reason, such as file does not exist. Comments I made in 765e060 describing a security issue are incorrect. Regardless of any fallback it is possible for the user to pass a string in the current locale that is also valid UTF-8. The security issue IMO is that it is not documented or well known which strings should be in which encoding, except that as developers we know UTF-8 is used for file paths internally (ie files opened by curl/libcurl) when Windows Unicode. Closes #xxxx
|
Reading into the details, I'm confused. The patch does not restore the dynamic 8-bit fallback, which I agree with. (Coming from comment, I first assumed it'd be restored.) But, what is the scenario when |
|
What is the dynamic fallback? I may have missed that. The reason I wrote a separate function to check UTF-8 is because it's more accurate to have a function that returns an error only on bad UTF-8. Using MultiByteToWideChar to check could fail for other reasons, though I'd guess it's pretty unlikely. |
|
By "dynamic fallback" I meant the original (now removed in Windows API docs says that |
It only falls back now if the string is not valid UTF-8. The original behavior to retry valid UTF-8 in the current locale could lead to libcurl reading or writing to a file unintended, so this is not a full revert. Conversely it is also true that a string in the current locale could be valid UTF-8 and that could lead to reading or writing to a file unintended. As discussed elsewhere the doc will need to be improved to address this #ifdef _UNICODE |
|
Oops, so my fear was right. I think this is a mistake and libs should not try to detect and decide what codepage the caller means. As I wrote elsewhere, an invalid UTF-8 still may have been intented as UTF-8 and thus should fail when used, also valid UTF-8 may well be a valid 8-bit string as well. Meaning it is not possible to auto-detect UTF-8 vs 8-bit codepages, and even less so the original intent of the caller. I'm very strongly against committing such change. It takes away control from the user and the good intention will result in unexpected behaviour or worse case, a footgun. If the goal is to maintain support for both UTF-8 and 8-bit codepages using the exact same API, the transparent and correct way of dealing with this is to let the user specify if "ANSI"/UNICODE should be passed/returned for non-ASCII libcurl strings. In UNICODE builds this can have a default that makes calls backwards compatible (accepting/returning 8-bit "ANSI"), while allowing to pass/return UTF-8 when explicitly enabled. Though I'd argue that compatibility has already been broken a while ago when UNICODE builds became possible without paying attention to this, so making the default encoding UTF-8 in these builds is probably the cleaner solution. |
It's highly unlikely that invalid UTF-8 is intended as UTF-8, and highly likely that it's actually in the current locale given that it isn't documented in the options (and which options?) that UTF-8 is used for some strings (and which strings?) in Windows Unicode builds. There is too much I don't know about this and I think not having this fallback could break things (and which things?) for users passing non-ascii strings in those builds, which I again would assume are in current locale. |
|
I believe it is especially dangerous to introduce such feature if we don't understand the full extent of its consequences. At the end it will make the system behave unexpectedly for every scenario. It's not a good result if libcurl misdetects a filename as UTF-8, tries to open it and fail, leaving libcurl without a CA store, pin or public/private key (or another dozen of similar security-related external file). While this may be unlikely (how unlikely? we don't know), it can be even worse: the user can end up with a different CA store/key/pin, making this exploitable. (Speaking of non-ASCII URLs, the situation is similarly bad, but maybe easier to grasp: who would want a libcurl heuristics to decide which URL to hit over the network?) Given the undocumented past of string encoding/codepage requirements and the lack of much/any bug reports regarding this, I just don't see what such auto-detection effort is trying to resolve exactly. What is easier to see is that the proposed solution is heuristic and risky. Since this area is/was undocumented, I believe the correct path is bringing it to a well-defined shape and document it. And in a well-defined shape the best is to leave out auto-detection and similar fallback heuristics. |
|
I disagree. We already had this behavior. It's not ideal but not putting it in the upcoming release is sure to break some things. What I don't know but unless someone else is in support of this we're going to find out. |
|
I think the best solution will be to stop creating UNICODE builds on Windows. They were broken from the start and with changes like this one, the situation will be even worse. |
On closer inspection, the state of Unicode support in libcurl does not seem to be ready for production. Existing support extended certain Windows interfaces to use the Unicode flavour of the Windows API, but that also meant that the expected encoding/codepage of strings (e.g. local filenames, URLs) exchanged via the libcurl API became ambiguous and undefined. Previously all strings had to be passed in the active Windows locale, using an 8-bit codepage. In Unicode libcurl builds, the expected string encoding became an undocumented mixture of UTF-8 and 8-bit locale, depending on the actual API/option, certain dynamic and static "fallback" logic inside libcurl and even in OpenSSL, while some parts of libcurl kept using 8-bit strings internally. From the user's perspective this poses an unreasonably difficult task in finding out how to pass a certain non-ASCII string to a specific API without unwanted or accidental (possibly lossy) conversions or other side-effects. Missing the correct encoding may result in unexpected behaviour, e.g. in some cases not finding files, finding different files, accessing the wrong URL or passing a corrupt username or password. Note that these issues may _only_ affect strings with _non-ASCII_ content. For now the best solution seems to be to revert back to how libcurl/curl worked for most of its existence and only re-enable Unicode once the remaining parts of Windows Unicode support are well-understood, ironed out and documented. Unicode was enabled in curl-for-win about a year ago with 7.71.0. Hopefully this period had the benefit to have surfaced some of these issues. Ref: curl/curl#6089 Ref: curl/curl#7246 Ref: curl/curl#7251 Ref: curl/curl#7252 Ref: curl/curl#7257 Ref: curl/curl#7281 Ref: curl/curl#7421 Ref: https://github.com/curl/curl/wiki/libcurl-and-expected-string-encodings Ref: 8023ee5
On closer inspection, the state of Unicode support in libcurl does not seem to be ready for production. Existing support extended certain Windows interfaces to use the Unicode flavour of the Windows API, but that also meant that the expected encoding/codepage of strings (e.g. local filenames, URLs) exchanged via the libcurl API became ambiguous and undefined. Previously all strings had to be passed in the active Windows locale, using an 8-bit codepage. In Unicode libcurl builds, the expected string encoding became an undocumented mixture of UTF-8 and 8-bit locale, depending on the actual API/option, certain dynamic and static "fallback" logic inside libcurl and even in OpenSSL, while some parts of libcurl kept using 8-bit strings internally. From the user's perspective this poses an unreasonably difficult task in finding out how to pass a certain non-ASCII string to a specific API without unwanted or accidental (possibly lossy) conversions or other side-effects. Missing the correct encoding may result in unexpected behaviour, e.g. in some cases not finding files, finding different files, accessing the wrong URL or passing a corrupt username or password. Note that these issues may _only_ affect strings with _non-ASCII_ content. For now the best solution seems to be to revert back to how libcurl/curl worked for most of its existence and only re-enable Unicode once the remaining parts of Windows Unicode support are well-understood, ironed out and documented. Unicode was enabled in curl-for-win about a year ago with 7.71.0. Hopefully this period had the benefit to have surfaced some of these issues. Ref: curl/curl#6089 Ref: curl/curl#7246 Ref: curl/curl#7251 Ref: curl/curl#7252 Ref: curl/curl#7257 Ref: curl/curl#7281 Ref: curl/curl#7421 Ref: https://github.com/curl/curl/wiki/libcurl-and-expected-string-encodings Ref: 8023ee5
On closer inspection, the state of Unicode support in libcurl does not seem to be ready for production. Existing support extended certain Windows interfaces to use the Unicode flavour of the Windows API, but that also meant that the expected encoding/codepage of strings (e.g. local filenames, URLs) exchanged via the libcurl API became ambiguous and undefined. Previously all strings had to be passed in the active Windows locale, using an 8-bit codepage. In Unicode libcurl builds, the expected string encoding became an undocumented mixture of UTF-8 and 8-bit locale, depending on the actual API/option, certain dynamic and static "fallback" logic inside libcurl and even in OpenSSL, while some parts of libcurl kept using 8-bit strings internally. From the user's perspective this poses an unreasonably difficult task in finding out how to pass a certain non-ASCII string to a specific API without unwanted or accidental (possibly lossy) conversions or other side-effects. Missing the correct encoding may result in unexpected behaviour, e.g. in some cases not finding files, finding different files, accessing the wrong URL or passing a corrupt username or password. Note that these issues may _only_ affect strings with _non-ASCII_ content. For now the best solution seems to be to revert back to how libcurl/curl worked for most of its existence and only re-enable Unicode once the remaining parts of Windows Unicode support are well-understood, ironed out and documented. Unicode was enabled in curl-for-win about a year ago with 7.71.0. Hopefully this period had the benefit to have surfaced some of these issues. Ref: curl/curl#6089 Ref: curl/curl#7246 Ref: curl/curl#7251 Ref: curl/curl#7252 Ref: curl/curl#7257 Ref: curl/curl#7281 Ref: curl/curl#7421 Ref: https://github.com/curl/curl/wiki/libcurl-and-expected-string-encodings Ref: 8023ee5
On closer inspection, the state of Windows Unicode support in libcurl does not seem to be ready for production. Existing support extended certain Windows interfaces to use the Unicode flavour of the Windows API, but that also meant that the expected encoding/codepage of strings (e.g. local filenames, URLs) exchanged via the libcurl API became ambiguous and undefined. Previously all strings had to be passed in the active Windows locale, using an 8-bit codepage. In Unicode libcurl builds, the expected string encoding became an undocumented mixture of UTF-8 and 8-bit locale, depending on the actual API, build options/dependencies, internal fallback logic based on runtime auto-detection of passed string, and the result of file operations (scheduled for removal in 7.78.0). While some parts of libcurl kept using 8-bit strings internally, e.g. when reading the environment. From the user's perspective this poses an unreasonably complex task in finding out how to pass (or read) a certain non-ASCII string to (from) a specific API without unwanted or accidental conversions or other side-effects. Missing the correct encoding may result in unexpected behaviour, e.g. in some cases not finding files, reading/writing a different file, accessing the wrong URL or passing a corrupt username or password. Note that these issues may only affect strings with _non-7-bit-ASCII_ content. For now the least bad solution seems to be to revert back to how libcurl/curl worked for most of its existence and only re-enable Unicode once the remaining parts of Windows Unicode support are well-understood, ironed out and documented. Unicode was enabled in curl-for-win about a year ago with 7.71.0. Hopefully this period had the benefit to have surfaced some of these issues. Ref: curl/curl#6089 Ref: curl/curl#7246 Ref: curl/curl#7251 Ref: curl/curl#7252 Ref: curl/curl#7257 Ref: curl/curl#7281 Ref: curl/curl#7421 Ref: https://github.com/curl/curl/wiki/libcurl-and-expected-string-encodings Ref: 8023ee5
On closer inspection, the state of Windows Unicode support in libcurl does not seem to be ready for production. Existing support extended certain Windows interfaces to use the Unicode flavour of the Windows API, but that also meant that the expected encoding/codepage of strings (e.g. local filenames, URLs) exchanged via the libcurl API became ambiguous and undefined. Previously all strings had to be passed in the active Windows locale, using an 8-bit codepage. In Unicode libcurl builds, the expected string encoding became an undocumented mixture of UTF-8 and 8-bit locale, depending on the actual API, build options/dependencies, internal fallback logic based on runtime auto-detection of passed string, and the result of file operations (scheduled for removal in 7.78.0). While some parts of libcurl kept using 8-bit strings internally, e.g. when reading the environment. From the user's perspective this poses an unreasonably complex task in finding out how to pass (or read) a certain non-ASCII string to (from) a specific API without unwanted or accidental conversions or other side-effects. Missing the correct encoding may result in unexpected behaviour, e.g. in some cases not finding files, reading/writing a different file, accessing the wrong URL or passing a corrupt username or password. Note that these issues may only affect strings with _non-7-bit-ASCII_ content. For now the least bad solution seems to be to revert back to how libcurl/curl worked for most of its existence and only re-enable Unicode once the remaining parts of Windows Unicode support are well-understood, ironed out and documented. Unicode was enabled in curl-for-win about a year ago with 7.71.0. Hopefully this period had the benefit to have surfaced some of these issues. Ref: curl/curl#6089 Ref: curl/curl#7246 Ref: curl/curl#7251 Ref: curl/curl#7252 Ref: curl/curl#7257 Ref: curl/curl#7281 Ref: curl/curl#7421 Ref: https://github.com/curl/curl/wiki/libcurl-and-expected-string-encodings Ref: 8023ee5
|
Closing because no problem reports since the time 7.78 was released without the fallbacks. |
This is an undo of 765e060 which removed local encoding fallbacks in
Windows Unicode builds.
Briefly, Windows Unicode builds are expected to use UTF-8 for file paths
internally however the user may pass paths in the current locale. Though
that is not expected for paths that are opened internally it may be
necessary for paths passed to dependencies that expect current locale
encoding. It is not documented which paths are expected to be passed in
UTF-8 encoding in Windows Unicode builds.
Prior to 765e060 the behavior was to convert the passed in string from
UTF-8 to UTF-16 and then pass that to the respective wide character file
function (eg _wfopen). If the conversion or the file function failed
then as a fallback it would try the passed in string with the regular
file function (eg fopen).
In comparison, the behavior in this change is check explicitly to see if
the string is valid UTF-8 before the conversion and only if it is not
then fall back on the respective file function that uses the current
locale.
The main difference is the file function failing no longer causes a
fallback to occur. That removes what could be considered a slight
security or stability issue since it could fail for any reason, such
as file does not exist.
Comments I made in 765e060 describing a security issue are incorrect.
Regardless of any fallback it is possible for the user to pass a string
in the current locale that is also valid UTF-8. The security issue IMO
is that it is not documented or well known which strings should be in
which encoding, except that as developers we know UTF-8 is used for file
paths internally (ie files opened by curl/libcurl) when Windows Unicode.
Closes #xxxx