Skip to content

bytes_regex should permit generation of byte sequences that are invalid UTF-8 #336

@zackw

Description

@zackw

This strategy function

fn invalid_ts() -> impl Strategy<Value = Vec<u8>> {
    prop::string::bytes_regex(
        r"(?s-u:|[^0-9].*|[0-9]+[^0-9.].*|[0-9]+\.[0-9]*[^0-9].*)"
    ).unwrap()
}

is intended to generate, among other things, invalid UTF-8 byte sequences, because it's for testing a parser that works directly from data on disk that cannot be trusted. What it actually does is crash on the unwrap() with

thread 'attrs::test::parse_xattr_ts_invalid' panicked at 'called `Result::unwrap()` on an `Err` value:
    RegexSyntax(Translate(Error {
        kind: InvalidUtf8,
        pattern: "(?s-u:|[^0-9].*|[0-9]+[^0-9.].*|[0-9]+\\.[0-9]*[^0-9].*)",
        span: Span(Position(o: 7, l: 1, c: 8), Position(o: 13, l: 1, c: 14))
    }))'

Looking at the code, I believe the change needed is for bytes_regex to have its own version of regex_to_hir that calls ParserBuilder::allow_invalid_utf8(true).

This change should have no visible effect on any existing code that uses bytes_regex, since one must also opt into generation of invalid UTF-8 within the regex itself (that's one of the things the (?s-u: does) and any existing regex that uses that flag must be using it in a way that actually can't generate invalid UTF-8, or else they'd hit the same crash I'm hitting.

Metadata

Metadata

Assignees

No one assigned

    Labels

    help-requestThis issue is asking for advice/help on using proptest

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions