Skip to content

regexp_match skips first match when returning match #3803

@Jefffrey

Description

@Jefffrey

Describe the bug

In some cases regexp_match will skip first and only match.

e.g. if pattern is foo and string to match is foo then should return single match foo. Currently returning empty array for the match (correctly finds there is a match, but doesn't return the match correctly).

To Reproduce

Example test in arrow-string/src/regexp.rs

    #[test]
    fn sandbox() {
        let array = StringArray::from(vec![Some("foo")]);
        let pattern = GenericStringArray::<i32>::from(vec![r"foo"]);
        let actual = regexp_match(&array, &pattern, None).unwrap();
        let result = actual.as_any().downcast_ref::<ListArray>().unwrap();
        let elem_builder: GenericStringBuilder<i32> = GenericStringBuilder::new();
        let mut expected_builder = ListBuilder::new(elem_builder);
        expected_builder.values().append_value("foo");
        expected_builder.append(true);
        let expected = expected_builder.finish();
        assert_eq!(&expected, result);
    }

Will panic with:

thread 'regexp::tests::sandbox' panicked at 'assertion failed: `(left == right)`
  left: `ListArray
[
  StringArray
[
  "foo",
],
]`,
 right: `ListArray
[
  StringArray
[
],
]`', arrow-string/src/regexp.rs:277:9

Can see the right (actual) has empty StringArray[] whereas expected contains the match: StringArray["foo"]

Expected behavior

Test should succeed.

Additional context

Seems its because by default skipping the first match in a capture group:

match re.captures(value) {
Some(caps) => {
for m in caps.iter().skip(1).flatten() {
list_builder.values().append_value(m.as_str());
}
list_builder.append(true);
}
None => list_builder.append(false),
}

Where in the test example above, caps has value:

[arrow-string/src/regexp.rs:212] &caps = Captures(
    {
        0: Some(
            "foo",
        ),
    },
)

Relevant regex doc: https://docs.rs/regex/latest/regex/struct.Regex.html#method.captures

Specifically:

Capture group 0 always corresponds to the entire match.

Original issue: apache/datafusion#5479

Metadata

Metadata

Assignees

No one assigned

    Labels

    arrowChanges to the arrow cratebug

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions