<format> Properly handle multibyte encodings in format strings#1815
<format> Properly handle multibyte encodings in format strings#1815CaseyCarter merged 10 commits intomicrosoft:feature/formatfrom
Conversation
|
|
||
| template <class charT, class integral> | ||
| void test_intergal_specs() { | ||
| void test_integral_specs() { |
There was a problem hiding this comment.
Intergal specs: 💃👓💃. (No change requested, just thought this was a funny typo.)
miscco
left a comment
There was a problem hiding this comment.
Now that I have reviewed it, how can I make all this encoding horror unseen?
| auto _UResult = _RANGES _Find_unchecked( | ||
| _Get_unwrapped(_STD move(_First)), _Get_unwrapped(_STD move(_Last)), _Val, _Pass_fn(_Proj)); |
There was a problem hiding this comment.
I do not get why we are pulling those algorithms out. We do not really want the bounds checks and we already have the unchecked iterators ready.
Why not simply call _RANGES _Find_unchecked and _RANGES _Copy_unchecked ?
There was a problem hiding this comment.
For calling directly with range arguments when we don't have iterators handy, it's much simpler to call the range algorithm. (Recall that Range algorithms with range arguments don't perform bounds checks - they trust that begin() and end() return a valid range.)
| [[fallthrough]]; | ||
| case 1: | ||
| return 1; // all characters have only one code unit | ||
|
|
There was a problem hiding this comment.
What is it with these newlines?
There was a problem hiding this comment.
In a switch with cases that fall through, I like to use newlines after only the cases that do not fall through as a visual reinforcement of the control flow structure.
|
|
||
| // TRANSITION, add support for unicode/wide formats | ||
| _Out = _RANGES fill_n(_STD move(_Out), _Fill_left, _Specs._Fill[0]); | ||
| const basic_string_view<_CharT> _Fill_char{_Specs._Fill, _RANGES find(_Specs._Fill, '\0')}; |
There was a problem hiding this comment.
I belive this is broken
- We should use
_RANGES _Unchecked_find - If we have wide characters this is only one element, so we should not search
- If we have plain characters you just changed
_Fillto only contain _CharType() so you will always findend(_Specs._Fill)
There was a problem hiding this comment.
we might want to just store the length in character types of the fill character array
There was a problem hiding this comment.
- I'm too lazy to call
beginandendon a range when the range algorithm can do it for me. - Wide characters are UTF-16 on our implementation, so two may be needed to represent a fill character (the test case that uses 🏈, for example).
meow woof[4] = {quack};is equivalent tomeow woof[4] = {quack, {}, {}, {}};, not tomeow woof[4] = {quack, quack, quack, quack};.
we might want to just store the length in character types of the fill character array
I'm largely indifferent; it's just another char in _Basic_format_specs. I'll happily make the change if you think it's warranted; please say so more definitively than "we might."
| _Out = _RANGES fill_n(_STD move(_Out), _Fill_left, _Specs._Fill[0]); | ||
| const basic_string_view<_CharT> _Fill_char{_Specs._Fill, _RANGES find(_Specs._Fill, '\0')}; | ||
| for (; _Fill_left > 0; --_Fill_left) { | ||
| _Out = _RANGES copy(_Fill_char, _STD move(_Out)).out; |
There was a problem hiding this comment.
Ditto _RANGES _Unchecked_copy
There was a problem hiding this comment.
Ditto "I see no benefit commensurate with the additional code". Do you?
|
|
||
| // TRANSITION, add support for unicode/wide formats | ||
| _Out = _RANGES fill_n(_STD move(_Out), _Fill_left, _Specs._Fill[0]); | ||
| const basic_string_view<_CharT> _Fill_char{_Specs._Fill, _RANGES find(_Specs._Fill, '\0')}; |
There was a problem hiding this comment.
we might want to just store the length in character types of the fill character array
| } | ||
| } | ||
|
|
||
| case 4: // Assume UTF-8 (as does _Mbrtowc) |
There was a problem hiding this comment.
Are non-UTF-8 4-byte encodings (e.g. GB18030/code page 54936) possible here?
There was a problem hiding this comment.
This function (which is effectively a partially-inlined call to _Mbrtowc, which implements the equivalent for std::codecvt) and _Mbrtowc certainly don't think so. I'd like to do better, but I'm not prepared to overhaul how our locale machinery deals with character encodings in this PR.
Background: _Mbrtowc is a wrapper around a call to MultiByteToWideChar which relies on the anachronistic assumption that any encoded character can be converted to a single UCS-2 code unit. This is clearly wrong and has been for many years, but it's the mechanism we have right now for determining the extent of multibyte characters. I think we can do better, especially now that ICU is available so we may not need to write and maintain enormous quantities of per-supported-encoding code.
stl/inc/format
Outdated
| const auto _Len = static_cast<size_t>(_Last - _First); | ||
|
|
||
| if (_Ch < 0b1110'0000) { | ||
| if (_Ch >= 0b1100'0000 && _Len >= 2) { | ||
| return 2; | ||
| } | ||
|
|
||
| // non-lead byte or partial code unit | ||
| return -1; | ||
| } | ||
|
|
||
| if (_Ch < 0b1111'0000) { | ||
| if (_Len >= 3) { | ||
| return 3; | ||
| } | ||
|
|
||
| // partial code unit | ||
| return -1; | ||
| } | ||
|
|
||
| if (_Len >= 4) { | ||
| return 4; | ||
| } | ||
|
|
||
| // partial code unit | ||
| return -1; |
There was a problem hiding this comment.
This doesn't properly nor consistently handle all invalid UTF-8. We should either treat the partial/invalid code units as a "�" (details omitted) upon all invalid UTF-8 and continue parsing (this is the preferred default error handling method for decoding regular UTF-8 text), or throw format_error upon all invalid UTF-8.
Test cases:
| Next bytes | Valid UTF-8? | Non-throwing | Throwing | Current implementation |
|---|---|---|---|---|
"\xc2\xa2" |
Yes | 2 "¢" |
2 | 2 |
"\xc0\x80" |
No | 1 "��" |
-1 | 2 |
"\xc0\x7b" |
No | 1 "�{" |
-1 | 2 (bad!) |
There was a problem hiding this comment.
I've made no attempt to validate the encoding of the format string - we don't examine continuation bytes at all. I've only done the minimal amount of work needed to avoid running off the end of the format string. This seems reasonable for code that parses format strings, which are generally trusted.
…::operator*() error LNK2019: unresolved external symbol".
|
I've pushed two commits which should resolve the test failures:
|
|
x86 is passing, but there are truncation warnings-as-errors when building the STL itself for x64. |
... and simply saturate to `INT_MAX`.
| ptrdiff_t _Width = _Specs._Precision; | ||
| const _CharT* _Last = _Measure_string_prefix(_Value, _Width); | ||
|
|
||
| return _Write_aligned(_STD move(_Out), _Width, _Specs, _Align::_Left, [=](_OutputIt _Out) { |
There was a problem hiding this comment.
Weirdness: we don't do width estimation for fill characters, the spec simply assumes they have width 1. Consequently, format("{:3.2}", "日本地図") yields "日 ", whereas format("{:本<3.2}", "日本地図") yields "日本". This seems like a standard defect.
This was due to my lazy attempt to avoid overflow checking by using |
stl/inc/format
Outdated
|
|
||
| const auto _Len = static_cast<size_t>(_Last - _First); | ||
|
|
||
| if (_Ch < 0b1110'0000) { |
There was a problem hiding this comment.
possible future performance optimization would be branch prediction annotations.
There was a problem hiding this comment.
I thought about this, but decided to avoid them. The vague "arbitrarily more/less likely" wording in the Standard means you don't know if a compiler will reorder branches a bit, or move unlikely branches off into the wilderness with cold exception-throwing code. Something like the latter can be catastrophic to performance when you really are just making a guess about the distribution of inputs. I'm happy with ordering the branches by what we feel is decreasing frequency and leaving anything further to folks running PGO with (hopefully) representative inputs.
barcharcraz
left a comment
There was a problem hiding this comment.
Looks good after those minor nodiscard changes.
Adds support for multibyte fill characters and width estimation for strings.
Pushes
_RANGES copyand_RANGES fillup from<algorithm>into<xutility>. (Should be a pure cut'n'paste.)Fixes #1576.
Draft for now because I need some more test coverage for width estimation with and without alignment.