Skip to content

-w/--word-regexp sometimes returns odd results #2623

@BurntSushi

Description

@BurntSushi

It turns out that ripgrep's -w/--word-regexp flag doesn't quite do what it claims to. And as a result, it actually differs from GNU grep in some cases (and even with PCRE2 within ripgrep):

$ echo '###' | grep -w -o .
#
#
#
$ echo '###' | rg-13.0.0 -w -o .
#
#
$ echo '###' | rg-13.0.0 -P -w -o .
#
#
#

ripgrep 14 will fix this:

$ echo '###' | rg-14.0.0 -w -o .
#
#
#

The actual issue here is that ripgrep used a hacky work-around to implement -w/--word-regexp that just didn't work right in all cases. For PCRE2, it was implemented via look-around assertions that always got it right:

} else if self.word {
// We make this option exclusive with whole_line because when
// whole_line is enabled, all matches necessary fall on word
// boundaries. So this extra goop is strictly redundant.
singlepat = format!(r"(?<!\w)(?:{})(?!\w)", singlepat);
}

The work-around I used basically tried to emulate look-around with capture groups. And while it works in a lot of cases, it doesn't work in all of them. I spent quite a bit of time trying to figure out how to fix the work-around once and for all, but couldn't see a way to make the code obviously correct. Instead, I just added support for \b{start-half} and \b{end-half} word boundary assertions that do exactly what the look-around does in PCRE2. See: rust-lang/regex#469

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugA bug.rollupA PR that has been merged with many others in a rollup.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions