Skip to content

Conversation

@maartenbreddels
Copy link
Contributor

The second commit adds re2 to the linked libraries. @xhochy how should this be done, should I open a separate issue for this?

@github-actions
Copy link

@maartenbreddels
Copy link
Contributor Author

should re2 be available on all platform? appveyor for instance gives: 're2/re2.h': No such file or directory

@xhochy
Copy link
Member

xhochy commented Nov 24, 2020

@maartenbreddels Can you rebase this? That would be helpful for me setting up re2 in the toolchain.

@maartenbreddels
Copy link
Contributor Author

I cleaned it up a bit, but it might fail running the test, hope this is enough for you for now.

@maartenbreddels
Copy link
Contributor Author

There are two issues with this PR:

empty slices

In #8728 https://github.com/apache/arrow/pull/8728/files#diff-7771ecc138c4ecc2dc8498affe04f5b7f182c4b77b18a512c3dd07a82d45aa3dR116 was added, which adds an empty slice to the test, which breaks my test (which is good). However, my kernel will not be called, meaning i cannot set the output's type (which depends on the input regex). Shall I open a JIRA for that?

The error is:

# Array types differed: struct<letter: string, digit: string> vs struct<>
Null counts differ. Expected -1 but was 0
Expected:
  -- is_valid: all not null
  -- child 0 type: string
    []
  -- child 1 type: string
    []
Actual:
  -- is_valid: all not null

(showing the struct mismatch)

ChunkedArray fails

Commenting out the empty slice test, the next failure is with

AssertDatumsEqual(std::make_shared<ChunkedArray>(expected_chunks), out);

Failed
Got: 
  [
    -- is_valid: all not null
    -- child 0 type: string
      [
        "a"
      ]
    -- child 1 type: string
      [
        "1"
      ],
    -- is_valid:
          [
        true,
        false,
        false
      ]
    -- child 0 type: string
      [
        "b",
        "",
        ""
      ]
    -- child 1 type: string
      [
        "2",
        "",
        ""
      ]
  ]
Expected: 
  [
    -- is_valid: all not null
    -- child 0 type: string
      [
        "a"
      ]
    -- child 1 type: string
      [
        "1"
      ],
    -- is_valid:
          [
        true,
        false,
        false
      ]
    -- child 0 type: string
      [
        "b",
        "",
        ""
      ]
    -- child 1 type: string
      [
        "2",
        "",
        ""
      ]
  ]

I fail to see a difference in this output (maybe a null value in a child?), and this is difficult to debug in gdb. Maybe an opportunity to improve the printing of differences (since I fail to see any).

@jorisvandenbossche
Copy link
Member

@maartenbreddels what's the status of this PR? Ready for review, or are there still open issues? (you listed some above at #8459 (comment))

@nealrichardson
Copy link
Member

This needs rebase now that #8468 has merged.

@maartenbreddels
Copy link
Contributor Author

Rebased, but there are two issues left:

  • null values comparison fail, no idea what the issue is there, quite difficult for me to debug
  • the compute kernel does not get called on empty input, so i skip that test for now. But I think kernels should have the option to opt-in at least, so we can return an empty struct (all its arrays of length 0), but with field names.

@pitrou
Copy link
Member

pitrou commented Apr 15, 2021

@maartenbreddels Will you need help here?

@maartenbreddels
Copy link
Contributor Author

Yes, that would be appriciated, the null values I find difficult to debug, and the second issue I'm not sure that is good solution.

@pitrou
Copy link
Member

pitrou commented Apr 20, 2021

Ok, I'm looking at this.

@pitrou
Copy link
Member

pitrou commented Apr 20, 2021

Naming nit @jorisvandenbossche @jonkeane @ianmcook .
Do you think "extract_regex" is the right term for this function? Something else?

@pitrou pitrou force-pushed the ARROW-10195 branch 2 times, most recently from f565700 to 9b1f6e0 Compare April 20, 2021 18:50
@pitrou
Copy link
Member

pitrou commented Apr 20, 2021

@maartenbreddels For the record:

null values comparison fail, no idea what the issue is there, quite difficult for me to debug

I am not sure what was causing this, but my small rewrite of the kernel fixed it.

I think kernels should have the option to opt-in at least, so we can return an empty struct (all its arrays of length 0), but with field names.

This was solved by defining a Resolver function so that the FunctionOptions instance can participate in output type resolution (instead of specifying an empty struct type when declaring the kernel).

@jorisvandenbossche
Copy link
Member

Do you think "extract_regex" is the right term for this function? Something else?

In pandas, this is called "extract" (https://pandas.pydata.org/docs/reference/api/pandas.Series.str.extract.html), so from that point of view this is a good name.
In re, I think one would use re.search, but that's basically both used for "match" as for "extract" purposes, so I think "extract" is a better name for the kernel in this PR.

The _regex might be a bit superfluous, as there is no variant without regex (but on the other hand it explicitly signals it is using regular expressions).

@jorisvandenbossche
Copy link
Member

One thing I noted while quickly trying it out:

In [6]: pc.extract_regex(pa.array(['a1', 'b2', 'c3']), regex=r'(?P<letter>[ab])?(?P<digit>\d)')
Out[6]: 
<pyarrow.lib.StructArray object at 0x7f8c1bdf9760>
-- is_valid: all not null
-- child 0 type: string
  [
    "a",
    "b",
    ""
  ]
-- child 1 type: string
  [
    "1",
    "2",
    "3"
  ]

When having an optional group like the above, I would maybe expect a null instead of an empty string in the first child?

@pitrou
Copy link
Member

pitrou commented Apr 20, 2021

When having an optional group like the above, I would maybe expect a null instead of an empty string in the first child?

Agreed, but unfortunately re2 doesn't give us that information...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming nit: I would maybe call this "pattern" instead of "regex"

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, will merge

@pitrou pitrou closed this in 37c27d1 Apr 21, 2021
@maartenbreddels maartenbreddels deleted the ARROW-10195 branch April 21, 2021 14:05
@maartenbreddels
Copy link
Contributor Author

Thanks for finishing this Antoine!

This was solved by defining a Resolver function so that the FunctionOptions instance can participate in output type resolution (instead of specifying an empty struct type when declaring the kernel).

Good to know, thanks.

@pitrou
Copy link
Member

pitrou commented Apr 21, 2021

@maartenbreddels You're welcome :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants