ARROW-6784: [C++][R] Move filter and take for ChunkedArray, RecordBatch, and Table from Rcpp to C++ library #5686

nealrichardson · 2019-10-17T18:48:41Z

github-actions · 2019-10-17T18:56:03Z

https://issues.apache.org/jira/browse/ARROW-6784

bkietz

This is a good start. I think you can reduce the number of public facing overloads in favor of the Datum overloads.

bkietz · 2019-10-21T17:26:32Z

cpp/src/arrow/compute/kernels/filter.cc

Instead of using Concatenate here, I think it'd be better to use std::vector<ArrayVector> RechunkArraysConsistently(const std::vector<ArrayVector>&); (defined in array.h). After that the the chunks will be equal length, suitable for filtering.

Thanks, I'll look into that.

bkietz · 2019-10-21T17:32:25Z

cpp/src/arrow/compute/kernels/filter.cc

This can be addressed later, but: there's an unfortunate missed optimization here. Since we're reusing the same filter for each column we don't need to recount the set bits in each chunk of the filter for every column (see the record batch overload which takes advantage of this optimization).

Another missed (deferred) optimization would be to parallelize the filtering across the columns rather than iterating in serial.

I'll take a look at the one you reference and see if I can translate it for tables; otherwise I'll make followup Jiras.

cpp/src/arrow/compute/kernels/filter.h

bkietz · 2019-10-21T17:42:46Z

cpp/src/arrow/compute/kernels/take.h

I don't think it's necessary to pre-emptively add every possible permutation here. I added the RecordBatch overload because I specifically planned to use it in arrow::dataset::. Although it saves a few lines of code which would otherwise be necessary to box/unbox from compute::Datum, I think most consumers should rely on the Datum overload.

It's not preemptive, there are different implementations for each signature. Maybe you can show me what you mean; I haven't worked with Datum objects.

bkietz · 2019-10-21T17:59:23Z

cpp/src/arrow/compute/kernels/take.cc

Instead of concatenating here, you could flatten chunks from current_chunk directly into new_chunks:

ArrayVector new_chunks; for (const auto& indices_chunk : indices.chunks()) { std::shared_ptr<ChunkedArray> taken; RETURN_NOT_OK(Take(ctx, values, *indices_chunk, options, &taken)); std::move(taken->chunks()->begin(), taken->chunks()->end(), std::back_inserter(new_chunks)); }

(of course it's a moot point until Take(ChunkedArray values, Array indices) can also avoid concatenation)

The point of this concatenate was that the resulting chunks should correspond to the chunks defined by indices, as @jorisvandenbossche suggested. This, of course, gets us back to the original discussion of what chunks are for, whether they are purely an internal implementation detail or something that users should govern, what "optimal" chunking is, etc.

I think optimal chunking for output from a kernel is whatever allows the consumer greatest control over allocation and other performance overhead. Based on this: concatenation should be kept to a minimum since that generates new allocations instead of cheaply slicing existing ones. As a secondary consideration, the chunked array should have as few chunks as possible since large contiguous chunks can be processed more efficiently than lots of short chunks. Based on that: an output chunked array should not contain empty chunks.

bkietz · 2019-10-21T19:24:18Z

One other thing: it's not necessary to use the arrow:: prefix when inside our namespace, so we typically leave it off

nealrichardson

Thanks @bkietz!

nealrichardson · 2019-10-21T19:44:57Z

cpp/src/arrow/compute/kernels/take.cc

The point of this concatenate was that the resulting chunks should correspond to the chunks defined by indices, as @jorisvandenbossche suggested. This, of course, gets us back to the original discussion of what chunks are for, whether they are purely an internal implementation detail or something that users should govern, what "optimal" chunking is, etc.

nealrichardson · 2019-10-21T19:47:12Z

cpp/src/arrow/compute/kernels/filter.cc

Another missed (deferred) optimization would be to parallelize the filtering across the columns rather than iterating in serial.

I'll take a look at the one you reference and see if I can translate it for tables; otherwise I'll make followup Jiras.

nealrichardson · 2019-10-21T19:50:25Z

cpp/src/arrow/compute/kernels/take.h

It's not preemptive, there are different implementations for each signature. Maybe you can show me what you mean; I haven't worked with Datum objects.

nealrichardson · 2019-10-21T19:51:42Z

cpp/src/arrow/compute/kernels/filter.cc

Thanks, I'll look into that.

nealrichardson · 2019-10-28T20:17:41Z

@bkietz I've deferred the remaining refactoring to these followup issues: ARROW-6959, ARROW-7009, ARROW-7012

bkietz

This looks good. You've got a few flaky CI failures and a conversion warning:
https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/28439412/job/u8d468vtqmqn62wo#L994
Please fix those declarations then I think this is ready to land

cpp/src/arrow/compute/kernels/filter.cc

nealrichardson · 2019-10-29T21:56:16Z

@bkietz done here 6b62e62 PTAL

nealrichardson · 2019-10-30T02:28:54Z

CI is green except for the macos travis job that's broken on master.

wesm · 2019-10-31T16:16:26Z

Reviewing this now.

wesm

Some minor comments. Let me know if you want to make more changes, but at minimum I think there's some follow up refactoring to do re: Datum-based APIs

wesm · 2019-10-31T16:17:51Z

cpp/src/arrow/compute/kernels/filter.cc

We should create a function that accepts a lambda for chunked evaluation so this logic can be reused in other places. Does not have to happen in this patch

wesm · 2019-10-31T16:19:37Z

cpp/src/arrow/compute/kernels/filter.cc

Same with this logic. I'm not sure about the concatenation part, it seems like you would want to split larger chunks into smaller pieces, yielding an output that has more chunks than the input

e.g.

array chunks [10, 10, 10, 3]
filter chunks [5, 5, 5, 15, 3]

output chunks [5, 5, 5, 5, 10, 3]

Re: chunking strategy, see https://issues.apache.org/jira/browse/ARROW-7012

wesm · 2019-10-31T16:21:47Z

cpp/src/arrow/compute/kernels/filter_test.cc

nit: not fond of condensed variable names like schm

If I don't do something like this, I collide with the function named schema:

/Users/enpiar/Documents/ursa/arrow/cpp/src/arrow/compute/kernels/filter_test.cc:492:17: error: variable 'schema' declared with deduced type 'auto' cannot appear in its own initializer auto schema = schema(fields); ^

wesm · 2019-10-31T16:23:28Z

cpp/src/arrow/compute/kernels/take.cc

This is wasteful because values is going to be concatenated over and over for each chunk in indices. Can you add a note here to note this is bad and open a follow up JIRA?

I'll add the note. I think this falls under https://issues.apache.org/jira/browse/ARROW-7012.

wesm · 2019-10-31T16:24:54Z

cpp/src/arrow/compute/kernels/take.h

Frankly I would prefer a Take(FunctionContext*, const Datum&, const Datum&, ...) based API to this combinatorial explosion of functions. If this is not done in this PR, can you mark these functions as experimental so that we can change things without need for a deprecation cycle?

Will mark as experimental. See also https://issues.apache.org/jira/browse/ARROW-6959 about Datum policy.

wesm · 2019-10-31T16:26:02Z

cpp/src/arrow/compute/kernels/filter.h

Per comments on Take below I think that using Datum and having fewer public APIs would be better. There are implicit ctors for Datum to make the usage easier

I briefly looked into refactoring to use Datum but it didn't seem like a good use of my time right now to figure it out. https://issues.apache.org/jira/browse/ARROW-7009 is for someone else to pick that up.

…thout concatenation

Co-Authored-By: Benjamin Kietzman <bengilgit@gmail.com>

nealrichardson · 2019-11-01T22:38:44Z

Rebased, will merge when green unless there's objection.

codecov-io · 2019-11-02T00:54:52Z

Codecov Report

Merging #5686 into master will increase coverage by 0.56%.
The diff coverage is 97.94%.

@@            Coverage Diff             @@
##           master    #5686      +/-   ##
==========================================
+ Coverage   88.99%   89.56%   +0.56%     
==========================================
  Files        1006      814     -192     
  Lines      137246   121983   -15263     
  Branches     1501        0    -1501     
==========================================
- Hits       122142   109252   -12890     
+ Misses      14739    12731    -2008     
+ Partials      365        0     -365

Impacted Files	Coverage Δ
cpp/src/arrow/compute/kernels/filter.h	`66.66% <ø> (ø)`	⬆️
cpp/src/arrow/compute/kernels/take.h	`75% <ø> (ø)`	⬆️
cpp/src/arrow/testing/gtest_util.h	`97.36% <ø> (ø)`	⬆️
r/R/table.R	`95.65% <100%> (+0.19%)`	⬆️
cpp/src/arrow/compute/kernels/filter.cc	`99.13% <100%> (+0.66%)`	⬆️
cpp/src/arrow/compute/kernels/filter_test.cc	`100% <100%> (ø)`	⬆️
cpp/src/arrow/testing/gtest_util.cc	`62% <100%> (+3.3%)`	⬆️
r/R/chunked-array.R	`100% <100%> (+2.7%)`	⬆️
r/R/arrowExports.R	`74.68% <100%> (+0.23%)`	⬆️
r/R/array.R	`88.46% <100%> (+0.3%)`	⬆️
... and 199 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e73793e...21dbd26. Read the comment docs.

bkietz requested changes Oct 21, 2019

View reviewed changes

nealrichardson commented Oct 21, 2019

View reviewed changes

nealrichardson force-pushed the move-to-cpp branch from 683b73d to ba72253 Compare October 28, 2019 17:44

nealrichardson changed the title ~~ARROW-6784: [C++][R] Move filter, take, select C++ code from Rcpp to C++ library~~ ARROW-6784: [C++][R] Move filter and take for ChunkedArray, RecordBatch, and Table from Rcpp to C++ library Oct 28, 2019

nealrichardson marked this pull request as ready for review October 28, 2019 18:45

nealrichardson requested a review from bkietz October 28, 2019 20:16

bkietz requested changes Oct 29, 2019

View reviewed changes

cpp/src/arrow/compute/kernels/filter.cc Outdated Show resolved Hide resolved

cpp/src/arrow/compute/kernels/filter.cc Outdated Show resolved Hide resolved

nealrichardson force-pushed the move-to-cpp branch from 6c17594 to 6b62e62 Compare October 29, 2019 21:55

wesm self-requested a review October 30, 2019 21:53

bkietz approved these changes Oct 31, 2019

View reviewed changes

wesm reviewed Oct 31, 2019

View reviewed changes

nealrichardson force-pushed the move-to-cpp branch from 748140a to 9f37a69 Compare October 31, 2019 20:26

nealrichardson and others added 12 commits November 1, 2019 15:38

Use new RecordBatch Filter method

55cf808

Move RecordBatch Take to cpp; fill in a test for RecordBatch Filter

6793e58

Move ChunkedArray Filter (with Array) to cpp (test deferred)

1f5b573

Move ChunkedArray Filter (with ChunkedArray) to cpp (test deferred)

fc7c8dd

Move Table Filter methods to cpp (also pending tests)

ebaa2a3

Take(ChunkedArray, Array); needs test and should handle more cases wi…

d632b8d

…thout concatenation

Take(ChunkedArray, ChunkedArray), with R test

9ae88d8

Take(Array, ChunkedArray), with R test

3f01c9f

Take(Table, Array) and Take(Table, ChunkedArray), with R tests

e43617a

Add ChunkedArray tests for Filter and Take

e007760

Apply suggestions from code review

598fc22

Co-Authored-By: Benjamin Kietzman <bengilgit@gmail.com>

Add table tests

7dd52d6

nealrichardson added 6 commits November 1, 2019 15:38

Remove a bunch of leftover arrow::

7f2c436

lint

4ef90eb

moar lint

f707598

s/int/int64_t/g

e39425b

Too much int64_t

f29867f

Add notes and warnings

21dbd26

nealrichardson force-pushed the move-to-cpp branch from 9f37a69 to 21dbd26 Compare November 1, 2019 22:38

nealrichardson closed this in 7fd9bac Nov 2, 2019

nealrichardson deleted the move-to-cpp branch November 2, 2019 14:51

This was referenced Nov 4, 2019

[C++][R] Move filter and take code from Rcpp to C++ library #23121

Closed

[C++] Refactor filter/take kernels to use Datum instead of overloads #23323

Closed

[C++] Clarify ChunkedArray chunking strategy and policy #23326

Closed

ARROW-6784: [C++][R] Move filter and take for ChunkedArray, RecordBatch, and Table from Rcpp to C++ library #5686

ARROW-6784: [C++][R] Move filter and take for ChunkedArray, RecordBatch, and Table from Rcpp to C++ library #5686

Uh oh!

Conversation

nealrichardson commented Oct 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Oct 17, 2019

Uh oh!

bkietz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bkietz commented Oct 21, 2019

Uh oh!

nealrichardson left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nealrichardson commented Oct 28, 2019

Uh oh!

bkietz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

nealrichardson commented Oct 29, 2019

Uh oh!

nealrichardson commented Oct 30, 2019

Uh oh!

wesm commented Oct 31, 2019

Uh oh!

wesm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

nealrichardson commented Oct 17, 2019 •

edited

Loading