Skip to content

Conversation

@maartenbreddels
Copy link
Contributor

@maartenbreddels maartenbreddels commented Oct 15, 2020

Two new kernels

  • replace_substring like Python's str.replace
  • replace_substring_re2 like Python's re.sub

@github-actions
Copy link

@pitrou
Copy link
Member

pitrou commented Oct 19, 2020

You may want to remove the regex variant of this if you want to move this forward without depending on resolving the re2 dependency issue.

@jorisvandenbossche
Copy link
Member

@maartenbreddels would it be practical to split this into two PRs? (one for plain replace, other for re2-based replace) Or would you prefer first to have the re2 dependency issue resolved? (ARROW-10541)

@github-actions github-actions bot added the needs-rebase A PR that needs to be rebased by the author label Nov 25, 2020
@github-actions github-actions bot removed the needs-rebase A PR that needs to be rebased by the author label Nov 25, 2020
@maartenbreddels
Copy link
Contributor Author

I'd rather keep this 1 PR, looks like #8756 is working

@maartenbreddels maartenbreddels marked this pull request as ready for review November 25, 2020 16:12
@maartenbreddels
Copy link
Contributor Author

@pitrou this is ready for review, failure seems unrelated (minio on windows).

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat! See comments below.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReplaceString below is basically independent from Type, but using this idiom may compile it twice. Can you find another way to parametrize the kernel?
(hint: perhaps use composition rather than inheritance)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I misunderstand, but via offset_type we are not independent of Type right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I don't understand why offset_type is being used here. ValueDataBuilder is basically a TypedBufferBuilder<uint8_t>, it's used for building the string data, it doesn't deal with string offsets.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, yes, I see it now. I'm using offset_type in this class as well, which I shouldn't, I think that's what led me to this. This requires a bit of refactoring.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pattern occurs more often in the file, I didn't realize this lead to slower compilation and probably larger binary sizes. I think it requires a refactor that is larger than this PR. Also, I won't have the time currently to do this. Can we merge this as is, and I'll open a Jira issue?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, you don't need to refactor other kernels for now, but I suppose this one could easily be adapted, no? :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly as above, this looks basically independent from Type.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the GlobalReplace loop works a bit differently, it calls Match then Rewrite, avoiding the duplicate matching calls. Not sure it's worth optimizing this, though:
https://github.com/google/re2/blob/master/re2/re2.cc#L427

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I prefer to keep it as it is, I left a comment in the code so this doesn't get lost.

@github-actions github-actions bot added the needs-rebase A PR that needs to be rebased by the author label Nov 27, 2020
@github-actions github-actions bot removed the needs-rebase A PR that needs to be rebased by the author label Dec 18, 2020
@pitrou
Copy link
Member

pitrou commented Jan 19, 2021

@maartenbreddels Is it ready for review again? Feel free to ping me.

@maartenbreddels
Copy link
Contributor Author

@pitrou Yes, apart from an unanswered question this is ready for review 👍

@maartenbreddels
Copy link
Contributor Author

@pitrou this is ready for review (assuming you agree with the above plan of doing a refactor later on)

@jorisvandenbossche
Copy link
Member

(gentle ping here, would really like to see those PRs merged for 4.0 in April!)

@pitrou
Copy link
Member

pitrou commented Mar 25, 2021

I'll rebase and update this PR.

@pitrou
Copy link
Member

pitrou commented Mar 25, 2021

@nealrichardson
Copy link
Member

Haven't seen that before, and we've been building with re2 for months now. Maybe this is the first time we're building something that actually uses re2? It's possible that re2 needs a "backport" library built with the rtools3.5 toolchain; an immediate workaround could be to build with -DRE2_SOURCE=BUNDLED here, or to make ARROW_WITH_RE2 conditional on the toolchain (like we do for ARROW_S3 already). Given that we won't have to support the rtools3.5 toolchain after April or May, I'll try just turning it off here.

@nealrichardson
Copy link
Member

@github-actions crossbow submit -g r

@github-actions
Copy link

Revision: 82bb60d

Submitted crossbow builds: ursacomputing/crossbow @ actions-233

Task Status
conda-linux-gcc-py36-cpu-r36 Azure
conda-linux-gcc-py37-cpu-r40 Azure
conda-osx-clang-py36-r36 Azure
conda-osx-clang-py37-r40 Azure
conda-win-vs2017-py36-r36 Azure
conda-win-vs2017-py37-r40 Azure
homebrew-r-autobrew Github Actions
test-r-install-local Github Actions
test-r-linux-as-cran Github Actions
test-r-minimal-build Azure
test-r-rhub-ubuntu-gcc-release Azure
test-r-rocker-r-base-latest Azure
test-r-rstudio-r-base-3.6-bionic Azure
test-r-rstudio-r-base-3.6-centos7-devtoolset-8 Azure
test-r-rstudio-r-base-3.6-centos8 Azure
test-r-rstudio-r-base-3.6-opensuse15 Azure
test-r-rstudio-r-base-3.6-opensuse42 Azure
test-r-version-compatibility Github Actions
test-r-versions Github Actions
test-ubuntu-18.04-r-sanitizer Azure

@nealrichardson
Copy link
Member

The as-cran failure is legit: https://github.com/ursacomputing/crossbow/runs/2197449739?check_suite_focus=true#step:7:210

It's a weird build setup that uses clang and -stdlib=libc++; unfortunately we have to support it because it's one of CRAN's checks. But I'll deal with this in a followup.

@nealrichardson
Copy link
Member

I made ARROW-12094 for fixing that build. Merging now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants