-
Notifications
You must be signed in to change notification settings - Fork 5.4k
Description
From MSDN:
By default, $ matches only the end of the input string. If you specify the RegexOptions.Multiline option, it matches either the newline character (\n) or the end of the input string. It does not, however, match the carriage return/line feed character combination. To successfully match them, use the subexpression \r?$ instead of just $.
This is by far one of the biggest gotchas with using .NET Regex class.
I suggest adding a RegexOptions.AnyNewLine which treats $ as matching both Windows' Environment.NewLine and UNIX' Environment.NewLine, regardless of the Environment running corefx.
Portability concerns
According to Wikipedia, there are a ton of different operating systems, all with different line ending settings. The current implementation hardcodes Unix line-ending style. RegexOptions.AnyNewLine, defined as (?=[\r\n]|\z), would add support for Windows line-ending style.
The advise written in the current docs is actually not portable on Unix, which is becoming a more popular option. As it is suggested, \r?$ will capture one or two lines on Unix, and one on Windows. If you try running Windows assemblies with this hack on Linux, you will change the semantics of programs.
Backward compatibility concerns
Fully backward compatible: This RegexOptions enum extension would not be a default, and so it would not break any clients with reasonably written code. The only existing code that might display different behavior would be reflection code that sets every option on RegexOptions enum variable. I really can't envision anyone doing this on purpose.
Here is Petr Onderka (@svick)'s summary:
| OS | Line-ending style | Current | Environment.NewLine | AnyNewLine |
|---|---|---|---|---|
| Windows | Windows | ✗ | ✓ | ✓ |
| Windows | Unix | ✓ | ✗ | ✓ |
| Unix | Windows | ✗ | ✗ | ✓ |
| Unix | Unix | ✓ | ✓ | ✓ |
Api Proposal
edit by @ViktorHofer
namespace System.Text.RegularExpressions
{
[Flags]
public enum RegexOptions
{
None = 0x0000,
IgnoreCase = 0x0001, // "i"
Multiline = 0x0002, // "m"
ExplicitCapture = 0x0004, // "n"
Compiled = 0x0008, // "c"
Singleline = 0x0010, // "s"
IgnorePatternWhitespace = 0x0020, // "x"
RightToLeft = 0x0040, // "r"
#if DEBUG
Debug = 0x0080, // "d"
#endif
ECMAScript = 0x0100, // "e"
CultureInvariant = 0x0200,
+ AnyNewLine = 0x0400 // Treat "$" as (?=[\r\n]|\z)
}
}API Review Notes
Looks good. A few comments:
We cannot use the proposed value of 128 because it's already taken (see #if DBG in code)Spec updated so thatAnyNewLine = 0x400(1024).The table looks wrong (Windows on Windows on the Current should work IMHO)The table is correct. The fact this trips up experts just speaks to why this is a profound GOTCHA in the Core SDK.May be AcceptAllLineEndings?Some hallway testing I've done indicates AnyNewLine is a good name. Plus, (argument after the final name was chosen) this enumeration will be transliterated into a checkbox on Regular Expression Visualization tools like Regex Hero, so it is preferable to have a concise explanation for the feature to avoid excessive screen space.
PR Review Notes
After work had started on the approved proposal, @danmosemsft asked if the scope of this feature should be changed to also adjust the meaning \Z anchor. @jzabroski suggested writing how the end user documentation will look after this change, as good docs will determine if it is a function step improvement in usability and reducing gotchas.
Also, during the PR, it seems @shishirchawla also proposed AnyEndZ as a way to use AnyNewLine as an "Anchor Modifier", which will alter the meaning of '\Z' anchor in addition to altering the meaning of '$' anchor. The intent of this improvement appears to be to remove all platform-specific language from the Anchors documentation, which seems like a great improvement.
AnyNewLine as Anchor Modifier to \Z and $ Anchors
| flags | $ is treated as | $ documentation | \Z is treated as | \Z documentation |
|---|---|---|---|---|
| neither | (?=\n\z|\z) |
The match must occur at the end of the string or before \n at the end of the string. |
(Same as $ with this option.) |
(Same as $ with this option.) |
RegexOptions.Multiline |
(?=\n|\n\z|\z) |
The match must occur at the end of the string or before \n anywhere in the string. |
(?=\n\z|\z) |
The match must occur at the end of the string or before \n at the end of the string. |
RegexOptions.Multiline | RegexOptions.AnyNewLine |
(?=\r\n|\r|\n|\r\n\z|\r\z|\n\z|\z) |
The match must occur at the end of the string or before \r\n, \n or \r anywhere in the string. |
(?=\r\n\z|\r\z|\n\z|\z) |
The match must occur at the end of the string or before \r\n, \n or \r at the end of the string. |
RegexOptions.AnyNewLine |
(?=\r\n\z|\r\z|\n\z|\z) |
The match must occur at the end of the string or before \r\n, \n or \r at the end of the string. |
(Same as $ with this option.) |
(Same as $ with this option.) |