Skip to content

Conversation

@lidavidm
Copy link
Member

This implements nested field refs in C++ only, using a SmallVector to hold the FieldPath. This only lets us extract from a struct (I'm not so sure it makes sense for other types?).

The JIRA also requests being able to extract a field as an expression. I think this could be done by implementing a small kernel that we could call. (Or otherwise I think we'd have to add a new case to the Expression variant, which maybe isn't a big deal.) If that sounds reasonable it can be added here.

A microbenchmark was added to see if this impacts the common case of a non-nested field ref. On my local machine it does not appear to (~10-20ns difference).

@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@lidavidm
Copy link
Member Author

------------------------------------------------------------------------
Benchmark                              Time             CPU   Iterations
------------------------------------------------------------------------
Before:
BindAndEvaluate/simple_array         287 ns          287 ns      2384362
BindAndEvaluate/simple_scalar        283 ns          283 ns      2459757
After:
BindAndEvaluate/simple_array         285 ns          285 ns      2470618
BindAndEvaluate/simple_scalar        272 ns          272 ns      2574147
BindAndEvaluate/nested_array        1022 ns         1021 ns       685364
BindAndEvaluate/nested_scalar        632 ns          631 ns      1128511

@lidavidm
Copy link
Member Author

Note that nested_array is particularly bad (extracting from a StructArray) because we may have to allocate a new validity bitmap for the child, and we may have to do so multiple times (once per nesting level).

@westonpace
Copy link
Member

This only lets us extract from a struct (I'm not so sure it makes sense for other types?).

The compute IR currently allows references to be:

  • MapKey (similar to struct, only valid if the key of the map is string I think).
  • StructField (covered here)
  • ArraySubscript (I'm pretty sure this means you have a ListArray and you want the ith element of each list)
  • ArraySlice (Same as above, operates on list array, seems like overkill to me and should probably just be a function call at this point)

Maybe these don't need to be handled in the same way as nested references (e.g. maybe we capture an ArraySubscript reference in some other structure) but it seems we could cover the MapKey case easily enough at least.

@lidavidm
Copy link
Member Author

Ah, thanks for the reference.

For ArraySubscript: is that row-wise indexing or column-wise? It sounds like ArraySubscript is row-wise (you have a ListArray and want the ith row). And ArraySlice is ARROW-12669 (the list_element kernel).

MapKey: specifically, this sounds like extracting the value of a particular key for each slot of a MapArray? I think this could be handled, but doesn't quite fit with FieldRef here, and might also make sense as a kernel instead. Indeed there's this check here:

arrow/cpp/src/arrow/type.cc

Lines 1059 to 1064 in c3b7c96

Result<std::shared_ptr<ArrayData>> FieldPath::Get(const ArrayData& data) const {
if (data.type->id() != Type::STRUCT) {
return Status::NotImplemented("Get child data of non-struct array");
}
return FieldPathGetImpl::Get(this, data.child_data);
}

@westonpace
Copy link
Member

CC @cpcloud @bkietz thoughts? Thinking more on this I agree with you. We should probably prefer special functions and not try and overload FieldPath. The whole point of having references in the plan instead of functions/kernels is to allow for more intelligent plan transformations. By this point (FieldPath) we've already pretty much passed the "plan" part and so I don't see any disadvantage to using special functions for any of these references.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps a std::copy(std::begin(path.indices()), std::end(path.indices()), bound.indices) would work?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems SmallVector doesn't implement iterator_traits fully (I also had issues with trying to use insert() that I should try and debug)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/usr/lib/gcc/x86_64-linux-gnu/7.5.0/../../../../include/c++/7.5.0/bits/stl_algobase.h:378:46: error: no type named 'value_type' in 'std::iterator_traits<arrow::internal::StaticVectorImpl<int, 2, arrow::internal::SmallVectorStorage<int, 2> > >'
      typedef typename iterator_traits<_OI>::value_type _ValueTypeO;
              ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~
/usr/lib/gcc/x86_64-linux-gnu/7.5.0/../../../../include/c++/7.5.0/bits/stl_algobase.h:422:23: note: in instantiation of function template specialization 'std::__copy_move_a<false, const int *, arrow::internal::StaticVectorImpl<int, 2, arrow::internal::SmallVectorStorage<int, 2> > >' requested here
      return _OI(std::__copy_move_a<_IsMove>(std::__niter_base(__first),
                      ^
/usr/lib/gcc/x86_64-linux-gnu/7.5.0/../../../../include/c++/7.5.0/bits/stl_algobase.h:454:20: note: in instantiation of function template specialization 'std::__copy_move_a2<false, __gnu_cxx::__normal_iterator<const int *, std::vector<int, std::allocator<int> > >, arrow::internal::StaticVectorImpl<int, 2, arrow::internal::SmallVectorStorage<int, 2> > >' requested here
      return (std::__copy_move_a2<__is_move_iterator<_II>::__value>
                   ^
/home/lidavidm/Code/upstream/arrow-13987/cpp/src/arrow/compute/exec/expression.cc:401:10: note: in instantiation of function template specialization 'std::copy<__gnu_cxx::__normal_iterator<const int *, std::vector<int, std::allocator<int> > >, arrow::internal::StaticVectorImpl<int, 2, arrow::internal::SmallVectorStorage<int, 2> > >' requested here
    std::copy(path.indices().begin(), path.indices().end(), bound.indices);
         ^
/home/lidavidm/Code/upstream/arrow-13987/cpp/src/arrow/compute/exec/expression.cc:424:10: note: in instantiation of function template specialization 'arrow::compute::(anonymous namespace)::BindImpl<arrow::DataType>' requested here
  return BindImpl(*this, *in.type, in.shape, exec_context);
         ^

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah never mind that, I was being dumb about iterator vs container.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find Flatten a bit confusing, but I understand there's precedent.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could call it MakeFlattenedChild or GetFlattenedChild or something if that helps? I agree Flatten isn't the best name when it comes to a single child array.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated this and fixed the copy above to use std::copy.

@lidavidm
Copy link
Member Author

lidavidm commented Nov 2, 2021

Since we've been talking about unions, we could perhaps include union support here as well, analogously to struct support? That way you could grab one of the variants in the union, with non-matching rows masked off. (That's starting to be enough logic that I might prefer a kernel instead of inlining it here, though; but honestly, I think I'd prefer moving all the logic into kernels and just calling it from expression.cc)

@cpcloud
Copy link
Contributor

cpcloud commented Nov 2, 2021

I think making these kernels would probably be a good idea, if that allows us to move some of the logic into smaller functions.

Unions seem like a great follow up, and well-contained enough for its own PR. I don't feel strongly about that though and will happily review it wherever you decide to do it!

@nealrichardson
Copy link
Member

Can this be merged? Sounds like there are followup issues to be made as well (bindings in R, Python; other type support; making it a kernel)--please link them when they're made.

@lidavidm
Copy link
Member Author

lidavidm commented Nov 5, 2021

I was (slowly) working on union support but I'll split that out and get the logic here into a kernel when I get a chance.

Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two minor nits:

@lidavidm
Copy link
Member Author

lidavidm commented Nov 5, 2021

Follow ups: ARROW-14615 to add union support and refactor into a kernel; ARROW-11259 for Python bindings; ARROW-13858 for R bindings.

@ursabot
Copy link

ursabot commented Nov 8, 2021

Benchmark runs are scheduled for baseline = 41000a1 and contender = 8ebc505. 8ebc505 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️1.54% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.18% ⬆️0.04%] ursa-thinkcentre-m75q
Supported benchmarks:
ursa-i9-9960x: langs = Python, R, JavaScript
ursa-thinkcentre-m75q: langs = C++, Java
ec2-t3-xlarge-us-east-2: cloud = True

@jorisvandenbossche
Copy link
Member

As far as I understand, this PR does not yet enable you to use those nested field refs in dataset scans, is that correct? And is that planned as a follow-up? (do we have a JIRA for that?)

@lidavidm
Copy link
Member Author

This should now "work" on the C++ side, though I realize I didn't actually add tests for that.

For Python and R, we have the followups above (ARROW-11259 and ARROW-13858 respectively).

@lidavidm
Copy link
Member Author

I filed ARROW-14658.

@jorisvandenbossche
Copy link
Member

We have this line in the scanner code in C++:

return Status::NotImplemented("Nested field references in scans.");

which is what I ran into when naively trying to call this in Python (after adding a python binding for creating a nested FieldRef).

@lidavidm
Copy link
Member Author

Ah, sorry, I should've remembered to test that. I'll take a look at this today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants