Skip to content

Update line_char to tolerate <regex> round-trip conversions#478

Closed
StephanTLavavej wants to merge 2 commits intobuild2:masterfrom
StephanTLavavej:better-regex
Closed

Update line_char to tolerate <regex> round-trip conversions#478
StephanTLavavej wants to merge 2 commits intobuild2:masterfrom
StephanTLavavej:better-regex

Conversation

@StephanTLavavej
Copy link
Contributor

In MSVC's STL, we recently merged microsoft/STL#5592 to remove non-Standard _Uelem machinery from the <regex> parser. This broke build2, which we regularly build in our "Real World Code" test suite with development versions of MSVC.

The issue is that build2's use of a custom character type leads to an extremely brittle dependence on the internal implementation of <regex>, as indicated by the numerous comments in your codebase about the details of what MSVC's STL and libstdc++ happened to be doing. While parsing a regex, we need to compare its characters against escape sequences like R"(\bcats\b)" (word boundary assertions, for example). Our recent PR did this by constructing the user-defined character type from chars like 'b'. This is rejected by build2's assertions.

While I believe that build2 should strongly consider refactoring this entire code, I've created an additional MSVC STL PR that adjusts our comparisons to avoid requesting larger changes from build2: microsoft/STL#5675. Instead of constructing user-defined character types from char (and then comparing homogeneously), we'll directly compare against char. This is less theoretically pure, but uses build2's heterogeneous comparison operators, avoiding most problems.

However, we still need to detect whether user-defined character types can round-trip through char (to avoid being confused by truncation). This PR here is a small change to support that. In regex.hxx, it applies what operator char () is already doing, to operator T (). This allows all user-defined characters to be implicitly converted, returning \a (BELL) for your "non-special" type. (MSVC's STL needs this now because we sometimes convert to types other than char specifically, as part of our _Uelem removal/overhaul.) Then in regex.cxx, I'm adjusting line_char's constructor to accept this \a (BELL) character, to accept round-tripping without asserting.

This appears to pass build2's tests locally on my machine, with the updated MSVC STL.

@boris-kolpackov
Copy link
Member

Thanks for the PR and the detailed description of what's going on. Will look into it shortly and get back to you.

I also agree that what we do here is unfortunate but at the moment we don't see a better way to provide the same functionality (line-level regex) without heroic measures (like writing our own regex implementation or forking libc++'s one and using it everywhere). So we will limp along for the time being and I appreciate you willing to entertain a workaround for us in MSVC's STL.

@boris-kolpackov
Copy link
Member

Ok, this has been merged with a few tweaks as 68f80c8

The only tweak other than comments is to make the change in the line_char constructor only applicable to MSVC's STL (we want to detect and analyze similar changes in other implementations).

While the change to operator T() is uncontroversial, I am a bit fuzzy on the implications for the line_char constructor change. Specifically, the conversion chain line_char -> T -> line_char will change the line_char type from not special to special, which means it will unlikely to ever compare equal to any char. But I assume that's what you want, right?

We've also CI'ed it on the current MSVC's STL version as well as all the other STL implementations we support and there are no issues.

Let me know if you are happy with the end result and I will close this PR. I can also stage this change if you would like to try the "official" distribution (and maybe replace the old version in your "Real World Code" test suite).

@muellerj2
Copy link

Specifically, the conversion chain line_char -> T -> line_char will change the line_char type from not special to special, which means it will unlikely to ever compare equal to any char. But I assume that's what you want, right?

Yeah, this is the point of these casts. They are used used to check whether the value of a character type can be represented by an integer of type T, and the regex engine uses this information for various purposes (e.g., to decide whether a character can be stored in a bitmap,).

The non-special line_chars can't be represented in integers, so these conversion chains not comparing equal is exactly what should happen.

@StephanTLavavej
Copy link
Contributor Author

Let me know if you are happy with the end result and I will close this PR.

Yes, this is great, thanks! I've verified that your merged changes work with my updated STL.

I can also stage this change if you would like to try the "official" distribution (and maybe replace the old version in your "Real World Code" test suite).

Yes, that would be very helpful for our test team.

@boris-kolpackov
Copy link
Member

Yes, that would be very helpful for our test team.

Ok, here you go: https://stage.build2.org/0/0.18.0-a.0/

build2-toolchain-0.18.0-a.0.20250814065000.98b4d859b461 and all further versions will have this fix.

Thanks again for your help on this.

@StephanTLavavej StephanTLavavej deleted the better-regex branch August 16, 2025 22:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants