-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-6784: [C++][R] Move filter and take for ChunkedArray, RecordBatch, and Table from Rcpp to C++ library #5686
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
bkietz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good start. I think you can reduce the number of public facing overloads in favor of the Datum overloads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using Concatenate here, I think it'd be better to use std::vector<ArrayVector> RechunkArraysConsistently(const std::vector<ArrayVector>&); (defined in array.h). After that the the chunks will be equal length, suitable for filtering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I'll look into that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be addressed later, but: there's an unfortunate missed optimization here. Since we're reusing the same filter for each column we don't need to recount the set bits in each chunk of the filter for every column (see the record batch overload which takes advantage of this optimization).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another missed (deferred) optimization would be to parallelize the filtering across the columns rather than iterating in serial.
I'll take a look at the one you reference and see if I can translate it for tables; otherwise I'll make followup Jiras.
cpp/src/arrow/compute/kernels/take.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's necessary to pre-emptively add every possible permutation here. I added the RecordBatch overload because I specifically planned to use it in arrow::dataset::. Although it saves a few lines of code which would otherwise be necessary to box/unbox from compute::Datum, I think most consumers should rely on the Datum overload.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not preemptive, there are different implementations for each signature. Maybe you can show me what you mean; I haven't worked with Datum objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of concatenating here, you could flatten chunks from current_chunk directly into new_chunks:
ArrayVector new_chunks;
for (const auto& indices_chunk : indices.chunks()) {
std::shared_ptr<ChunkedArray> taken;
RETURN_NOT_OK(Take(ctx, values, *indices_chunk, options, &taken));
std::move(taken->chunks()->begin(), taken->chunks()->end(), std::back_inserter(new_chunks));
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(of course it's a moot point until Take(ChunkedArray values, Array indices) can also avoid concatenation)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point of this concatenate was that the resulting chunks should correspond to the chunks defined by indices, as @jorisvandenbossche suggested. This, of course, gets us back to the original discussion of what chunks are for, whether they are purely an internal implementation detail or something that users should govern, what "optimal" chunking is, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think optimal chunking for output from a kernel is whatever allows the consumer greatest control over allocation and other performance overhead. Based on this: concatenation should be kept to a minimum since that generates new allocations instead of cheaply slicing existing ones. As a secondary consideration, the chunked array should have as few chunks as possible since large contiguous chunks can be processed more efficiently than lots of short chunks. Based on that: an output chunked array should not contain empty chunks.
|
One other thing: it's not necessary to use the |
nealrichardson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @bkietz!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point of this concatenate was that the resulting chunks should correspond to the chunks defined by indices, as @jorisvandenbossche suggested. This, of course, gets us back to the original discussion of what chunks are for, whether they are purely an internal implementation detail or something that users should govern, what "optimal" chunking is, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another missed (deferred) optimization would be to parallelize the filtering across the columns rather than iterating in serial.
I'll take a look at the one you reference and see if I can translate it for tables; otherwise I'll make followup Jiras.
cpp/src/arrow/compute/kernels/take.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not preemptive, there are different implementations for each signature. Maybe you can show me what you mean; I haven't worked with Datum objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I'll look into that.
683b73d to
ba72253
Compare
|
@bkietz I've deferred the remaining refactoring to these followup issues: ARROW-6959, ARROW-7009, ARROW-7012 |
bkietz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good. You've got a few flaky CI failures and a conversion warning:
https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/28439412/job/u8d468vtqmqn62wo#L994
Please fix those declarations then I think this is ready to land
6c17594 to
6b62e62
Compare
|
CI is green except for the macos travis job that's broken on master. |
|
Reviewing this now. |
wesm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor comments. Let me know if you want to make more changes, but at minimum I think there's some follow up refactoring to do re: Datum-based APIs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should create a function that accepts a lambda for chunked evaluation so this logic can be reused in other places. Does not have to happen in this patch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same with this logic. I'm not sure about the concatenation part, it seems like you would want to split larger chunks into smaller pieces, yielding an output that has more chunks than the input
e.g.
array chunks [10, 10, 10, 3]
filter chunks [5, 5, 5, 15, 3]
output chunks [5, 5, 5, 5, 10, 3]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re: chunking strategy, see https://issues.apache.org/jira/browse/ARROW-7012
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: not fond of condensed variable names like schm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I don't do something like this, I collide with the function named schema:
/Users/enpiar/Documents/ursa/arrow/cpp/src/arrow/compute/kernels/filter_test.cc:492:17: error: variable 'schema' declared with deduced type 'auto' cannot appear in its own initializer
auto schema = schema(fields);
^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is wasteful because values is going to be concatenated over and over for each chunk in indices. Can you add a note here to note this is bad and open a follow up JIRA?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add the note. I think this falls under https://issues.apache.org/jira/browse/ARROW-7012.
cpp/src/arrow/compute/kernels/take.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Frankly I would prefer a Take(FunctionContext*, const Datum&, const Datum&, ...) based API to this combinatorial explosion of functions. If this is not done in this PR, can you mark these functions as experimental so that we can change things without need for a deprecation cycle?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will mark as experimental. See also https://issues.apache.org/jira/browse/ARROW-6959 about Datum policy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per comments on Take below I think that using Datum and having fewer public APIs would be better. There are implicit ctors for Datum to make the usage easier
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I briefly looked into refactoring to use Datum but it didn't seem like a good use of my time right now to figure it out. https://issues.apache.org/jira/browse/ARROW-7009 is for someone else to pick that up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem
748140a to
9f37a69
Compare
…thout concatenation
Co-Authored-By: Benjamin Kietzman <bengilgit@gmail.com>
9f37a69 to
21dbd26
Compare
|
Rebased, will merge when green unless there's objection. |
Codecov Report
@@ Coverage Diff @@
## master #5686 +/- ##
==========================================
+ Coverage 88.99% 89.56% +0.56%
==========================================
Files 1006 814 -192
Lines 137246 121983 -15263
Branches 1501 0 -1501
==========================================
- Hits 122142 109252 -12890
+ Misses 14739 12731 -2008
+ Partials 365 0 -365
Continue to review full report at Codecov.
|
Uh oh!
There was an error while loading. Please reload this page.