-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-13987: [C++] Support nested field refs #11466
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
|
|
Note that |
The compute IR currently allows references to be:
Maybe these don't need to be handled in the same way as nested references (e.g. maybe we capture an ArraySubscript reference in some other structure) but it seems we could cover the MapKey case easily enough at least. |
|
Ah, thanks for the reference. For ArraySubscript: is that row-wise indexing or column-wise? It sounds like ArraySubscript is row-wise (you have a ListArray and want the ith row). And ArraySlice is ARROW-12669 (the list_element kernel). MapKey: specifically, this sounds like extracting the value of a particular key for each slot of a MapArray? I think this could be handled, but doesn't quite fit with FieldRef here, and might also make sense as a kernel instead. Indeed there's this check here: Lines 1059 to 1064 in c3b7c96
|
|
CC @cpcloud @bkietz thoughts? Thinking more on this I agree with you. We should probably prefer special functions and not try and overload |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps a std::copy(std::begin(path.indices()), std::end(path.indices()), bound.indices) would work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems SmallVector doesn't implement iterator_traits fully (I also had issues with trying to use insert() that I should try and debug)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/usr/lib/gcc/x86_64-linux-gnu/7.5.0/../../../../include/c++/7.5.0/bits/stl_algobase.h:378:46: error: no type named 'value_type' in 'std::iterator_traits<arrow::internal::StaticVectorImpl<int, 2, arrow::internal::SmallVectorStorage<int, 2> > >'
typedef typename iterator_traits<_OI>::value_type _ValueTypeO;
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~
/usr/lib/gcc/x86_64-linux-gnu/7.5.0/../../../../include/c++/7.5.0/bits/stl_algobase.h:422:23: note: in instantiation of function template specialization 'std::__copy_move_a<false, const int *, arrow::internal::StaticVectorImpl<int, 2, arrow::internal::SmallVectorStorage<int, 2> > >' requested here
return _OI(std::__copy_move_a<_IsMove>(std::__niter_base(__first),
^
/usr/lib/gcc/x86_64-linux-gnu/7.5.0/../../../../include/c++/7.5.0/bits/stl_algobase.h:454:20: note: in instantiation of function template specialization 'std::__copy_move_a2<false, __gnu_cxx::__normal_iterator<const int *, std::vector<int, std::allocator<int> > >, arrow::internal::StaticVectorImpl<int, 2, arrow::internal::SmallVectorStorage<int, 2> > >' requested here
return (std::__copy_move_a2<__is_move_iterator<_II>::__value>
^
/home/lidavidm/Code/upstream/arrow-13987/cpp/src/arrow/compute/exec/expression.cc:401:10: note: in instantiation of function template specialization 'std::copy<__gnu_cxx::__normal_iterator<const int *, std::vector<int, std::allocator<int> > >, arrow::internal::StaticVectorImpl<int, 2, arrow::internal::SmallVectorStorage<int, 2> > >' requested here
std::copy(path.indices().begin(), path.indices().end(), bound.indices);
^
/home/lidavidm/Code/upstream/arrow-13987/cpp/src/arrow/compute/exec/expression.cc:424:10: note: in instantiation of function template specialization 'arrow::compute::(anonymous namespace)::BindImpl<arrow::DataType>' requested here
return BindImpl(*this, *in.type, in.shape, exec_context);
^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah never mind that, I was being dumb about iterator vs container.
cpp/src/arrow/array/array_nested.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find Flatten a bit confusing, but I understand there's precedent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could call it MakeFlattenedChild or GetFlattenedChild or something if that helps? I agree Flatten isn't the best name when it comes to a single child array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated this and fixed the copy above to use std::copy.
|
Since we've been talking about unions, we could perhaps include union support here as well, analogously to struct support? That way you could grab one of the variants in the union, with non-matching rows masked off. (That's starting to be enough logic that I might prefer a kernel instead of inlining it here, though; but honestly, I think I'd prefer moving all the logic into kernels and just calling it from expression.cc) |
|
I think making these kernels would probably be a good idea, if that allows us to move some of the logic into smaller functions. Unions seem like a great follow up, and well-contained enough for its own PR. I don't feel strongly about that though and will happily review it wherever you decide to do it! |
|
Can this be merged? Sounds like there are followup issues to be made as well (bindings in R, Python; other type support; making it a kernel)--please link them when they're made. |
|
I was (slowly) working on union support but I'll split that out and get the logic here into a kernel when I get a chance. |
bkietz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
two minor nits:
|
Follow ups: ARROW-14615 to add union support and refactor into a kernel; ARROW-11259 for Python bindings; ARROW-13858 for R bindings. |
|
Benchmark runs are scheduled for baseline = 41000a1 and contender = 8ebc505. 8ebc505 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
|
As far as I understand, this PR does not yet enable you to use those nested field refs in dataset scans, is that correct? And is that planned as a follow-up? (do we have a JIRA for that?) |
|
This should now "work" on the C++ side, though I realize I didn't actually add tests for that. For Python and R, we have the followups above (ARROW-11259 and ARROW-13858 respectively). |
|
I filed ARROW-14658. |
|
We have this line in the scanner code in C++:
which is what I ran into when naively trying to call this in Python (after adding a python binding for creating a nested FieldRef). |
|
Ah, sorry, I should've remembered to test that. I'll take a look at this today. |
This implements nested field refs in C++ only, using a SmallVector to hold the FieldPath. This only lets us extract from a struct (I'm not so sure it makes sense for other types?).
The JIRA also requests being able to extract a field as an expression. I think this could be done by implementing a small kernel that we could call. (Or otherwise I think we'd have to add a new case to the Expression variant, which maybe isn't a big deal.) If that sounds reasonable it can be added here.
A microbenchmark was added to see if this impacts the common case of a non-nested field ref. On my local machine it does not appear to (~10-20ns difference).