Skip to content

Conversation

@nealrichardson
Copy link
Member

@nealrichardson nealrichardson commented Oct 17, 2019

  • RecordBatch Filter
  • RecordBatch Take
  • ChunkedArray Filter
  • ChunkedArray Take
  • Table Filter
  • Table Take
  • Tests for ChunkedArray/Table Filter
  • lint etc.

@github-actions
Copy link

Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good start. I think you can reduce the number of public facing overloads in favor of the Datum overloads.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using Concatenate here, I think it'd be better to use std::vector<ArrayVector> RechunkArraysConsistently(const std::vector<ArrayVector>&); (defined in array.h). After that the the chunks will be equal length, suitable for filtering.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll look into that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be addressed later, but: there's an unfortunate missed optimization here. Since we're reusing the same filter for each column we don't need to recount the set bits in each chunk of the filter for every column (see the record batch overload which takes advantage of this optimization).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another missed (deferred) optimization would be to parallelize the filtering across the columns rather than iterating in serial.

I'll take a look at the one you reference and see if I can translate it for tables; otherwise I'll make followup Jiras.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's necessary to pre-emptively add every possible permutation here. I added the RecordBatch overload because I specifically planned to use it in arrow::dataset::. Although it saves a few lines of code which would otherwise be necessary to box/unbox from compute::Datum, I think most consumers should rely on the Datum overload.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not preemptive, there are different implementations for each signature. Maybe you can show me what you mean; I haven't worked with Datum objects.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of concatenating here, you could flatten chunks from current_chunk directly into new_chunks:

ArrayVector new_chunks;
for (const auto& indices_chunk : indices.chunks()) {
  std::shared_ptr<ChunkedArray> taken;
  RETURN_NOT_OK(Take(ctx, values, *indices_chunk, options, &taken));
  std::move(taken->chunks()->begin(), taken->chunks()->end(), std::back_inserter(new_chunks));
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(of course it's a moot point until Take(ChunkedArray values, Array indices) can also avoid concatenation)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of this concatenate was that the resulting chunks should correspond to the chunks defined by indices, as @jorisvandenbossche suggested. This, of course, gets us back to the original discussion of what chunks are for, whether they are purely an internal implementation detail or something that users should govern, what "optimal" chunking is, etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think optimal chunking for output from a kernel is whatever allows the consumer greatest control over allocation and other performance overhead. Based on this: concatenation should be kept to a minimum since that generates new allocations instead of cheaply slicing existing ones. As a secondary consideration, the chunked array should have as few chunks as possible since large contiguous chunks can be processed more efficiently than lots of short chunks. Based on that: an output chunked array should not contain empty chunks.

@bkietz
Copy link
Member

bkietz commented Oct 21, 2019

One other thing: it's not necessary to use the arrow:: prefix when inside our namespace, so we typically leave it off

Copy link
Member Author

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bkietz!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of this concatenate was that the resulting chunks should correspond to the chunks defined by indices, as @jorisvandenbossche suggested. This, of course, gets us back to the original discussion of what chunks are for, whether they are purely an internal implementation detail or something that users should govern, what "optimal" chunking is, etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another missed (deferred) optimization would be to parallelize the filtering across the columns rather than iterating in serial.

I'll take a look at the one you reference and see if I can translate it for tables; otherwise I'll make followup Jiras.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not preemptive, there are different implementations for each signature. Maybe you can show me what you mean; I haven't worked with Datum objects.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll look into that.

@nealrichardson nealrichardson changed the title ARROW-6784: [C++][R] Move filter, take, select C++ code from Rcpp to C++ library ARROW-6784: [C++][R] Move filter and take for ChunkedArray, RecordBatch, and Table from Rcpp to C++ library Oct 28, 2019
@nealrichardson nealrichardson marked this pull request as ready for review October 28, 2019 18:45
@nealrichardson nealrichardson requested a review from bkietz October 28, 2019 20:16
@nealrichardson
Copy link
Member Author

@bkietz I've deferred the remaining refactoring to these followup issues: ARROW-6959, ARROW-7009, ARROW-7012

Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. You've got a few flaky CI failures and a conversion warning:
https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/28439412/job/u8d468vtqmqn62wo#L994
Please fix those declarations then I think this is ready to land

@nealrichardson
Copy link
Member Author

@bkietz done here 6b62e62 PTAL

@nealrichardson
Copy link
Member Author

CI is green except for the macos travis job that's broken on master.

@wesm wesm self-requested a review October 30, 2019 21:53
@wesm
Copy link
Member

wesm commented Oct 31, 2019

Reviewing this now.

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments. Let me know if you want to make more changes, but at minimum I think there's some follow up refactoring to do re: Datum-based APIs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should create a function that accepts a lambda for chunked evaluation so this logic can be reused in other places. Does not have to happen in this patch

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same with this logic. I'm not sure about the concatenation part, it seems like you would want to split larger chunks into smaller pieces, yielding an output that has more chunks than the input

e.g.

array chunks [10, 10, 10, 3]
filter chunks [5, 5, 5, 15, 3]

output chunks [5, 5, 5, 5, 10, 3]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: not fond of condensed variable names like schm

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I don't do something like this, I collide with the function named schema:

/Users/enpiar/Documents/ursa/arrow/cpp/src/arrow/compute/kernels/filter_test.cc:492:17: error: variable 'schema' declared with deduced type 'auto' cannot appear in its own initializer
  auto schema = schema(fields);
                ^

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wasteful because values is going to be concatenated over and over for each chunk in indices. Can you add a note here to note this is bad and open a follow up JIRA?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add the note. I think this falls under https://issues.apache.org/jira/browse/ARROW-7012.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Frankly I would prefer a Take(FunctionContext*, const Datum&, const Datum&, ...) based API to this combinatorial explosion of functions. If this is not done in this PR, can you mark these functions as experimental so that we can change things without need for a deprecation cycle?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will mark as experimental. See also https://issues.apache.org/jira/browse/ARROW-6959 about Datum policy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per comments on Take below I think that using Datum and having fewer public APIs would be better. There are implicit ctors for Datum to make the usage easier

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I briefly looked into refactoring to use Datum but it didn't seem like a good use of my time right now to figure it out. https://issues.apache.org/jira/browse/ARROW-7009 is for someone else to pick that up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem

@nealrichardson
Copy link
Member Author

Rebased, will merge when green unless there's objection.

@codecov-io
Copy link

Codecov Report

Merging #5686 into master will increase coverage by 0.56%.
The diff coverage is 97.94%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5686      +/-   ##
==========================================
+ Coverage   88.99%   89.56%   +0.56%     
==========================================
  Files        1006      814     -192     
  Lines      137246   121983   -15263     
  Branches     1501        0    -1501     
==========================================
- Hits       122142   109252   -12890     
+ Misses      14739    12731    -2008     
+ Partials      365        0     -365
Impacted Files Coverage Δ
cpp/src/arrow/compute/kernels/filter.h 66.66% <ø> (ø) ⬆️
cpp/src/arrow/compute/kernels/take.h 75% <ø> (ø) ⬆️
cpp/src/arrow/testing/gtest_util.h 97.36% <ø> (ø) ⬆️
r/R/table.R 95.65% <100%> (+0.19%) ⬆️
cpp/src/arrow/compute/kernels/filter.cc 99.13% <100%> (+0.66%) ⬆️
cpp/src/arrow/compute/kernels/filter_test.cc 100% <100%> (ø) ⬆️
cpp/src/arrow/testing/gtest_util.cc 62% <100%> (+3.3%) ⬆️
r/R/chunked-array.R 100% <100%> (+2.7%) ⬆️
r/R/arrowExports.R 74.68% <100%> (+0.23%) ⬆️
r/R/array.R 88.46% <100%> (+0.3%) ⬆️
... and 199 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e73793e...21dbd26. Read the comment docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants