<format> Properly handle multibyte encodings in format strings by CaseyCarter · Pull Request #1815 · microsoft/STL

CaseyCarter · 2021-04-09T03:34:41Z

Adds support for multibyte fill characters and width estimation for strings.

Pushes _RANGES copy and _RANGES fill up from <algorithm> into <xutility>. (Should be a pure cut'n'paste.)

Fixes #1576.

Draft for now because I need some more test coverage for width estimation with and without alignment.

Fixes microsoft#1576.

stl/inc/format

CaseyCarter · 2021-04-09T03:53:40Z

tests/std/tests/P0645R10_text_formatting_formatting/test.cpp


 template <class charT, class integral>
-void test_intergal_specs() {
+void test_integral_specs() {


Intergal specs: 💃👓💃. (No change requested, just thought this was a funny typo.)

tests/std/tests/P0645R10_text_formatting_parsing/test.cpp

stl/inc/format

tests/std/tests/P0645R10_text_formatting_formatting/test.cpp

miscco

Now that I have reviewed it, how can I make all this encoding horror unseen?

miscco · 2021-04-09T07:43:02Z

stl/inc/algorithm

-            auto _UResult = _RANGES _Find_unchecked(
-                _Get_unwrapped(_STD move(_First)), _Get_unwrapped(_STD move(_Last)), _Val, _Pass_fn(_Proj));


I do not get why we are pulling those algorithms out. We do not really want the bounds checks and we already have the unchecked iterators ready.

Why not simply call _RANGES _Find_unchecked and _RANGES _Copy_unchecked ?

For calling directly with range arguments when we don't have iterators handy, it's much simpler to call the range algorithm. (Recall that Range algorithms with range arguments don't perform bounds checks - they trust that begin() and end() return a valid range.)

miscco · 2021-04-09T07:44:13Z

stl/inc/format

+        [[fallthrough]];
+    case 1:
+        return 1; // all characters have only one code unit
+


What is it with these newlines?

In a switch with cases that fall through, I like to use newlines after only the cases that do not fall through as a visual reinforcement of the control flow structure.

stl/inc/format

miscco · 2021-04-09T08:03:32Z

stl/inc/format


-    // TRANSITION, add support for unicode/wide formats
-    _Out = _RANGES fill_n(_STD move(_Out), _Fill_left, _Specs._Fill[0]);
+    const basic_string_view<_CharT> _Fill_char{_Specs._Fill, _RANGES find(_Specs._Fill, '\0')};


I belive this is broken

We should use _RANGES _Unchecked_find

If we have wide characters this is only one element, so we should not search

If we have plain characters you just changed _Fill to only contain _CharType( ) so you will always find end(_Specs._Fill)

we might want to just store the length in character types of the fill character array

I'm too lazy to call begin and end on a range when the range algorithm can do it for me.

Wide characters are UTF-16 on our implementation, so two may be needed to represent a fill character (the test case that uses 🏈, for example).

meow woof[4] = {quack}; is equivalent to meow woof[4] = {quack, {}, {}, {}};, not to meow woof[4] = {quack, quack, quack, quack};.

we might want to just store the length in character types of the fill character array

I'm largely indifferent; it's just another char in _Basic_format_specs. I'll happily make the change if you think it's warranted; please say so more definitively than "we might."

miscco · 2021-04-09T08:07:20Z

stl/inc/format

-    _Out = _RANGES fill_n(_STD move(_Out), _Fill_left, _Specs._Fill[0]);
+    const basic_string_view<_CharT> _Fill_char{_Specs._Fill, _RANGES find(_Specs._Fill, '\0')};
+    for (; _Fill_left > 0; --_Fill_left) {
+        _Out = _RANGES copy(_Fill_char, _STD move(_Out)).out;


Ditto _RANGES _Unchecked_copy

Ditto "I see no benefit commensurate with the additional code". Do you?

stl/inc/format

barcharcraz · 2021-04-09T09:27:16Z

stl/inc/format


-    // TRANSITION, add support for unicode/wide formats
-    _Out = _RANGES fill_n(_STD move(_Out), _Fill_left, _Specs._Fill[0]);
+    const basic_string_view<_CharT> _Fill_char{_Specs._Fill, _RANGES find(_Specs._Fill, '\0')};


we might want to just store the length in character types of the fill character array

statementreply · 2021-04-09T10:02:49Z

stl/inc/format

+            }
+        }
+
+    case 4: // Assume UTF-8 (as does _Mbrtowc)


Are non-UTF-8 4-byte encodings (e.g. GB18030/code page 54936) possible here?

This function (which is effectively a partially-inlined call to _Mbrtowc, which implements the equivalent for std::codecvt) and _Mbrtowc certainly don't think so. I'd like to do better, but I'm not prepared to overhaul how our locale machinery deals with character encodings in this PR.

Background: _Mbrtowc is a wrapper around a call to MultiByteToWideChar which relies on the anachronistic assumption that any encoded character can be converted to a single UCS-2 code unit. This is clearly wrong and has been for many years, but it's the mechanism we have right now for determining the extent of multibyte characters. I think we can do better, especially now that ICU is available so we may not need to write and maintain enormous quantities of per-supported-encoding code.

statementreply · 2021-04-09T10:24:21Z

stl/inc/format

+            const auto _Len = static_cast<size_t>(_Last - _First);
+
+            if (_Ch < 0b1110'0000) {
+                if (_Ch >= 0b1100'0000 && _Len >= 2) {
+                    return 2;
+                }
+
+                // non-lead byte or partial code unit
+                return -1;
+            }
+
+            if (_Ch < 0b1111'0000) {
+                if (_Len >= 3) {
+                    return 3;
+                }
+
+                // partial code unit
+                return -1;
+            }
+
+            if (_Len >= 4) {
+                return 4;
+            }
+
+            // partial code unit
+            return -1;


This doesn't properly nor consistently handle all invalid UTF-8. We should either treat the partial/invalid code units as a "�" (details omitted) upon all invalid UTF-8 and continue parsing (this is the preferred default error handling method for decoding regular UTF-8 text), or throw format_error upon all invalid UTF-8.

Test cases:

Next bytes Valid UTF-8? Non-throwing Throwing Current implementation

"\xc2\xa2" Yes 2 "¢" 2 2

"\xc0\x80" No 1 "��" -1 2

"\xc0\x7b" No 1 "�{" -1 2 (bad!)

I've made no attempt to validate the encoding of the format string - we don't examine continuation bytes at all. I've only done the minimal amount of work needed to avoid running off the end of the format string. This seems reasonable for code that parses format strings, which are generally trusted.

stl/inc/format

…::operator*() error LNK2019: unresolved external symbol".

StephanTLavavej · 2021-04-09T11:57:42Z

I've pushed two commits which should resolve the test failures:

Remove constexpr from P0355R7_calendars_and_time_zones_formatting.
Workaround VSO-1308657 "Standard Library Header Units: std::projected::operator*() error LNK2019: unresolved external symbol".
- The workaround is to add a definition that just calls abort().

StephanTLavavej · 2021-04-09T12:21:36Z

x86 is passing, but there are truncation warnings-as-errors when building the STL itself for x64.

... and simply saturate to `INT_MAX`.

CaseyCarter · 2021-04-09T20:15:05Z

stl/inc/format

+    ptrdiff_t _Width    = _Specs._Precision;
+    const _CharT* _Last = _Measure_string_prefix(_Value, _Width);
+
+    return _Write_aligned(_STD move(_Out), _Width, _Specs, _Align::_Left, [=](_OutputIt _Out) {


Weirdness: we don't do width estimation for fill characters, the spec simply assumes they have width 1. Consequently, format("{:3.2}", "日本地図") yields "日 ", whereas format("{:本<3.2}", "日本地図") yields "日本". This seems like a standard defect.

CaseyCarter · 2021-04-09T20:28:57Z

x86 is passing, but there are truncation warnings-as-errors when building the STL itself for x64.

This was due to my lazy attempt to avoid overflow checking by using ptrdiff_t for width estimations. Those responsible have been sacked.

barcharcraz · 2021-04-09T19:29:14Z

stl/inc/format

+
+            const auto _Len = static_cast<size_t>(_Last - _First);
+
+            if (_Ch < 0b1110'0000) {


possible future performance optimization would be branch prediction annotations.

I thought about this, but decided to avoid them. The vague "arbitrarily more/less likely" wording in the Standard means you don't know if a compiler will reorder branches a bit, or move unlikely branches off into the wilderness with cold exception-throwing code. Something like the latter can be catastrophic to performance when you really are just making a guess about the distribution of inputs. I'm happy with ordering the branches by what we feel is decreasing frequency and leaving anything further to folks running PGO with (hopefully) representative inputs.

stl/inc/format

barcharcraz

Looks good after those minor nodiscard changes.

stl/inc/chrono

stl/inc/format

Properly handle multibyte encodings in format strings

36cbbe7

Fixes microsoft#1576.

CaseyCarter added cxx20 C++20 feature format C++20/23 format labels Apr 9, 2021

CaseyCarter requested a review from a team as a code owner April 9, 2021 03:34

CaseyCarter commented Apr 9, 2021

View reviewed changes

stl/inc/format Show resolved Hide resolved

CaseyCarter commented Apr 9, 2021

View reviewed changes

stl/inc/format Outdated Show resolved Hide resolved

CaseyCarter commented Apr 9, 2021

View reviewed changes

stl/inc/format Outdated Show resolved Hide resolved

CaseyCarter commented Apr 9, 2021

View reviewed changes

stl/inc/format Outdated Show resolved Hide resolved

CaseyCarter commented Apr 9, 2021

View reviewed changes

stl/inc/format Outdated Show resolved Hide resolved

CaseyCarter commented Apr 9, 2021

View reviewed changes

stl/inc/format Outdated Show resolved Hide resolved

CaseyCarter commented Apr 9, 2021

View reviewed changes

stl/inc/format Outdated Show resolved Hide resolved

CaseyCarter commented Apr 9, 2021

View reviewed changes

stl/inc/format Outdated Show resolved Hide resolved

CaseyCarter commented Apr 9, 2021

View reviewed changes

tests/std/tests/P0645R10_text_formatting_parsing/test.cpp Outdated Show resolved Hide resolved

CaseyCarter commented Apr 9, 2021

View reviewed changes

tests/std/tests/P0645R10_text_formatting_parsing/test.cpp Outdated Show resolved Hide resolved

CaseyCarter marked this pull request as draft April 9, 2021 03:57

CaseyCarter added 3 commits April 8, 2021 21:02

Merge remote-tracking branch 'origin/feature/format' into gh1576

433acae

Casey's review comments

2957cb9

one more comment cleanup

fc38447

CaseyCarter marked this pull request as ready for review April 9, 2021 06:16

StephanTLavavej reviewed Apr 9, 2021

View reviewed changes

miscco reviewed Apr 9, 2021

View reviewed changes

Remove constexpr from P0355R7_calendars_and_time_zones_formatting.

2e05f65

barcharcraz changed the title ~~Properly handle multibyte encodings in format strings~~ <format> Properly handle multibyte encodings in format strings Apr 9, 2021

barcharcraz reviewed Apr 9, 2021

View reviewed changes

statementreply reviewed Apr 9, 2021

View reviewed changes

Workaround VSO-1308657 "Standard Library Header Units: std::projected…

2eec644

…::operator*() error LNK2019: unresolved external symbol".

Use int instead of ptrdiff_t for width estimation

04d4497

... and simply saturate to `INT_MAX`.

Sundry review comments

e52dde5

CaseyCarter commented Apr 9, 2021

View reviewed changes

test update

da5378d

barcharcraz reviewed Apr 9, 2021

View reviewed changes

stl/inc/format Outdated Show resolved Hide resolved

barcharcraz reviewed Apr 9, 2021

View reviewed changes

stl/inc/format Outdated Show resolved Hide resolved

barcharcraz approved these changes Apr 9, 2021

View reviewed changes

StephanTLavavej approved these changes Apr 9, 2021

View reviewed changes

stl/inc/chrono Outdated Show resolved Hide resolved

stl/inc/chrono Outdated Show resolved Hide resolved

stl/inc/format Outdated Show resolved Hide resolved

MOAR comments

92e0bca

StephanTLavavej approved these changes Apr 9, 2021

View reviewed changes

CaseyCarter merged commit e5e024a into microsoft:feature/format Apr 9, 2021

CaseyCarter deleted the gh1576 branch April 9, 2021 23:33

CaseyCarter mentioned this pull request Apr 10, 2021

<format>: Format string parsing needs to be charset aware #1576

Closed

statementreply mentioned this pull request Apr 11, 2021

<format>: Avoid unnecessary calls to _Getcvt() #1825

Closed

Flamefire mentioned this pull request May 23, 2022

CodeCvt in locales don't work on Windows with multi-byte wide-chars boostorg/locale#71

Closed

frederick-vs-ja mentioned this pull request May 30, 2025

<xutility>: Usage of decltype(auto) in the operator() of ranges::iter_move's type seems buggy #5555

Closed

		auto _UResult = _RANGES _Find_unchecked(
		_Get_unwrapped(_STD move(_First)), _Get_unwrapped(_STD move(_Last)), _Val, _Pass_fn(_Proj));

Next bytes	Valid UTF-8?	Non-throwing	Throwing	Current implementation
`"\xc2\xa2"`	Yes	2 `"¢"`	2	2
`"\xc0\x80"`	No	1 `"��"`	-1	2
`"\xc0\x7b"`	No	1 `"�{"`	-1	2 (bad!)


		const auto _Len = static_cast<size_t>(_Last - _First);

		if (_Ch < 0b1110'0000) {

Conversation

CaseyCarter commented Apr 9, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CaseyCarter Apr 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

miscco left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CaseyCarter Apr 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CaseyCarter Apr 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

statementreply Apr 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

StephanTLavavej commented Apr 9, 2021

Uh oh!

StephanTLavavej commented Apr 9, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CaseyCarter commented Apr 9, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CaseyCarter Apr 9, 2021 •

edited

Loading

CaseyCarter Apr 9, 2021 •

edited

Loading

CaseyCarter Apr 9, 2021 •

edited

Loading

statementreply Apr 9, 2021 •

edited

Loading