Skip to content

regex and regex_lite disagree on \b behavior on Unicode accent character #1356

@orlp

Description

@orlp

What version of regex are you using?

regex == 1.12.3
regex_lite == 0.1.9

Describe the bug at a high level.

The Unicode character U+0595 (which is an accent mark) has different behavior for \b between the two engines.

What are the steps to reproduce the behavior?

dbg!(Regex::new("\\b").unwrap().find_iter("\u{595}").collect::<Vec<_>>());
dbg!(LRegex::new("\\b").unwrap().find_iter("\u{595}").collect::<Vec<_>>());

What is the actual behavior?

[src/main.rs:15:5] Regex::new("\\b").unwrap().find_iter("\u{595}").collect::<Vec<_>>() = [
    Match {
        start: 0,
        end: 0,
        string: "",
    },
    Match {
        start: 2,
        end: 2,
        string: "",
    },
]
[src/main.rs:16:5] LRegex::new("\\b").unwrap().find_iter("\u{595}").collect::<Vec<_>>() = []

What is the expected behavior?

I don't know, but the same behavior for both. I found it through fuzzing.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions