-
-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Description
It turns out that ripgrep's -w/--word-regexp flag doesn't quite do what it claims to. And as a result, it actually differs from GNU grep in some cases (and even with PCRE2 within ripgrep):
$ echo '###' | grep -w -o .
#
#
#
$ echo '###' | rg-13.0.0 -w -o .
#
#
$ echo '###' | rg-13.0.0 -P -w -o .
#
#
#
ripgrep 14 will fix this:
$ echo '###' | rg-14.0.0 -w -o .
#
#
#
The actual issue here is that ripgrep used a hacky work-around to implement -w/--word-regexp that just didn't work right in all cases. For PCRE2, it was implemented via look-around assertions that always got it right:
ripgrep/crates/pcre2/src/matcher.rs
Lines 64 to 69 in 52731cd
| } else if self.word { | |
| // We make this option exclusive with whole_line because when | |
| // whole_line is enabled, all matches necessary fall on word | |
| // boundaries. So this extra goop is strictly redundant. | |
| singlepat = format!(r"(?<!\w)(?:{})(?!\w)", singlepat); | |
| } |
The work-around I used basically tried to emulate look-around with capture groups. And while it works in a lot of cases, it doesn't work in all of them. I spent quite a bit of time trying to figure out how to fix the work-around once and for all, but couldn't see a way to make the code obviously correct. Instead, I just added support for \b{start-half} and \b{end-half} word boundary assertions that do exactly what the look-around does in PCRE2. See: rust-lang/regex#469