`bytes_regex` should permit generation of byte sequences that are invalid UTF-8

This strategy function

```rust
fn invalid_ts() -> impl Strategy<Value = Vec<u8>> {
    prop::string::bytes_regex(
        r"(?s-u:|[^0-9].*|[0-9]+[^0-9.].*|[0-9]+\.[0-9]*[^0-9].*)"
    ).unwrap()
}
```

is _intended_ to generate, among other things, invalid UTF-8 byte sequences, because it's for testing a parser that works directly from data on disk that cannot be trusted.  What it actually _does_ is crash on the `unwrap()` with 

```none
thread 'attrs::test::parse_xattr_ts_invalid' panicked at 'called `Result::unwrap()` on an `Err` value:
    RegexSyntax(Translate(Error {
        kind: InvalidUtf8,
        pattern: "(?s-u:|[^0-9].*|[0-9]+[^0-9.].*|[0-9]+\\.[0-9]*[^0-9].*)",
        span: Span(Position(o: 7, l: 1, c: 8), Position(o: 13, l: 1, c: 14))
    }))'
```

Looking at the code, I _believe_ the change needed is for `bytes_regex` to have its own version of `regex_to_hir` that calls [`ParserBuilder::allow_invalid_utf8(true)`](https://docs.rs/regex-syntax/0.6.28/regex_syntax/struct.ParserBuilder.html#method.allow_invalid_utf8).

This change should have no visible effect on any existing code that uses `bytes_regex`, since one must also opt into generation of invalid UTF-8 within the regex itself (that's one of the things the `(?s-u:` does) and any existing regex that uses that flag must be using it in a way that actually _can't_ generate invalid UTF-8, or else they'd hit the same crash I'm hitting.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`bytes_regex` should permit generation of byte sequences that are invalid UTF-8 #336

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

bytes_regex should permit generation of byte sequences that are invalid UTF-8 #336

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`bytes_regex` should permit generation of byte sequences that are invalid UTF-8 #336