Skip to content

Conversation

@lidavidm
Copy link
Member

This is a draft of adding more utility methods to FunctionOptions. It's not fully implemented (it needs rebasing + serialization isn't implemented for most options, plus there are various TODOs scattered). But before I proceed further, I wanted to get some feedback.

Some concerns I have:

  • I don't like adding protected methods to a struct, and it's inconsistent with how equality is implemented for other structs (via a visitor or otherwise centralized in a single location). However ARROW-8891 will require that we be able to define kernels - and presumably their options - in a separate shared library, so I don't think we can do much better than this.
  • But for (de)serialization, we'll still need some way to dynamically register the mapping between a type_name and the actual struct, so maybe this is a moot point.
  • I've exposed the fact that serialization uses StructScalars to support Expression - but maybe this is too much to commit to in the API?

@lidavidm
Copy link
Member Author

@bkietz I'd appreciate if you had any feedback on the approach here/if this was in line with what you were thinking.

@github-actions
Copy link

@bkietz
Copy link
Member

bkietz commented Jun 14, 2021

@lidavidm thanks for working on this!

I don't like adding protected methods to a struct, and it's inconsistent with how equality is implemented for other structs (via a visitor or otherwise centralized in a single location). However ARROW-8891 will require that we be able to define kernels - and presumably their options - in a separate shared library, so I don't think we can do much better than this.

I agree that it's unfortunately necessary in this case: FunctionOptions is an extension point (unlike Array or Scalar, where we own the full enumeration of subclasses). In light of that: FunctionOptions should be promoted from struct to class (same for its subclasses).

But for (de)serialization, we'll still need some way to dynamically register the mapping between a type_name and the actual struct, so maybe this is a moot point.

We could add this to FunctionRegistry, I think:

class FunctionOptionsType {
 public:
  virtual std::string name() const = 0;

  virtual bool Compare(const FunctionOptions&, const FunctionOptions&) const = 0;
  virtual std::string Stringify(const FunctionOptions&) const = 0;
  virtual Result<std::unique_ptr<Buffer>> Serialize(const FunctionOptions&) const = 0;
  virtual Result<std::unique_ptr<FunctionOptions>> Deserialize(const Buffer&) const = 0;
};

class FunctionOptions {
 public:
  virtual ~FunctionOptions() = default;
  const FunctionOptionsType* options_type() const { return type_; }

  bool Equals(const FunctionOptions& other) const {
    return type_ = other.type_ && type_->Compare(*this, other);
  }
  // forward other methods to type_ as well

 protected:
  explicit FunctionOptions(const FunctionOptionsType* type) : type_(type) {}
  const FunctionOptionsType* type_;
};

class FunctionRegistry {
 public:
  // Called alongside AddFunction
  // will raise if another options type with the same name exists
  Status AddFunctionOptionsType(const FunctionOptionsType*);
};

class Function {
 public:
  // nullptr if the function does not take options.
  // This is also nice since we can assert correctly typed FunctionOptions
  const FunctionOptionsType* options_type() const;
};

I've exposed the fact that serialization uses StructScalars to support Expression - but maybe this is too much to commit to in the API?

Yes, I'd prefer that implementation detail be kept internal. Please ensure ToScalar/FromScalar can't be accessed without including an _internal header.

@pitrou
Copy link
Member

pitrou commented Jun 16, 2021

Can we explain the objectives and design constraints here?

@pitrou
Copy link
Member

pitrou commented Jun 16, 2021

Also, if the goal is to serialize function options as a struct scalar using the IPC format, it would be good to check performance, because I fear there may be a lot of overhead doing so.

@pitrou
Copy link
Member

pitrou commented Jun 16, 2021

One more thought: we should try to minimize the hand-writing of boilerplate. Is it possible to somehow define a subclass of std::tuple<Args...> that would automatically generate at least some of the methods (such as comparison and serialization)?

@pitrou
Copy link
Member

pitrou commented Jun 16, 2021

There's an intriguing thing in Boost which seems to allow to treat a struct as a tuple, but the doc is obnoxious:
https://www.boost.org/doc/libs/1_76_0/libs/fusion/doc/html/fusion/adapted/adapt_struct.html

Another intriguing thing when looking for "C++ named tuple":
https://stackoverflow.com/a/17072195/10194

@lidavidm
Copy link
Member Author

Can we explain the objectives and design constraints here?

The goal is to just add basic utility methods on FunctionOptions. Then we could make them serializable/pickleable (and by extension, we could then support pickling/unpickling things like Expressions, and properly compare them too). The main constraint is not assuming that all the subclasses are statically known, so that we can move the function implementations into a separate library eventually with ARROW-8891.

Also, if the goal is to serialize function options as a struct scalar using the IPC format, it would be good to check performance, because I fear there may be a lot of overhead doing so.

I'll add a benchmark.

One more thought: we should try to minimize the hand-writing of boilerplate.

Yes…I'll look into this, this initial refactor was rather tedious.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add operator== and PrintTo so these can just be written

Suggested change
ASSERT_TRUE(cur.Equals(cur)) << cur.ToString();
ASSERT_EQ(cur, cur);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think FunctionOptionsType itself needs to be internal/opaque. Instead we can have the public methods be Serialize/Deserialize and let To/FromStructScalar be the internal details

Comment on lines 50 to 51
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
static std::unique_ptr<internal::FunctionOptionsType> instance(new CastOptionsType());
return instance.get();
static const CastOptionsType instance;
return &instance;

@bkietz
Copy link
Member

bkietz commented Jun 21, 2021

@lidavidm 676c902

@lidavidm
Copy link
Member Author

Thanks! I'll circle back to this soon

@lidavidm
Copy link
Member Author

Alright, I've pushed half a stab at this - I'm going to pick up ARROW-11932 so I can write a generic serialization for vector<T> (where T itself is serializable to Scalar). We'll also probably want some way to cut down on the boilerplate for enums - right now I have a set of partially specialized traits to handle enums, which are a little tedious to implement; a macro would probably help there.

@lidavidm lidavidm force-pushed the arrow-13025 branch 6 times, most recently from 4967267 to a635db7 Compare June 25, 2021 15:26
@lidavidm lidavidm marked this pull request as ready for review June 25, 2021 15:57
@lidavidm
Copy link
Member Author

Thanks. Some of this applies to #10579 so I'll address that there and rebase.

Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only small nits remain, this is looking great

Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for doing this!


class ARROW_EXPORT CompareOptions : public FunctionOptions {
public:
explicit CompareOptions(CompareOperator op);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this is necessary; that's unfortunate. Could you open a follow up JIRA to remove the binding to CompareOptions and demote this to a deprecated struct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed ARROW-13219


namespace internal {
template <>
struct EnumTraits<compute::SortOrder>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to put this here instead of api_vector.cc?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found it had to be here to be considered by the overloads of the Generic* functions for SortKey, which in turn I had to place in this header or they wouldn't be considered by the generated class in GetFunctionOptionsType. If I move all the overloads for SortKey/SortOrder into api_vector.cc, then I get errors like this:

In file included from /home/lidavidm/Code/upstream/arrow-13025/cpp/src/arrow/compute/api_vector.cc:28:
/home/lidavidm/Code/upstream/arrow-13025/cpp/src/arrow/compute/function_internal.h: In instantiation of 'arrow::Result<std::shared_ptr<arrow::Scalar> > arrow::compute::internal::GenericToScalar(const std::vector<T>&) [with T = arrow::compute::SortKey]':
/home/lidavidm/Code/upstream/arrow-13025/cpp/src/arrow/compute/function_internal.h:462:34:   required from 'void arrow::compute::internal::ToStructScalarImpl<Options>::operator()(const Property&, size_t) [with Property = arrow::internal::DataMemberProperty<arrow::compute::SortOptions, std::vector<arrow::compute::SortKey> >; Options = arrow::compute::SortOptions; size_t = long unsigned int]'
/home/lidavidm/Code/upstream/arrow-13025/cpp/src/arrow/util/reflection_internal.h:67:28:   required from 'void arrow::internal::ForEachTupleMemberImpl(const std::tuple<_Elements ...>&, Fn&&, arrow::internal::index_sequence<I ...>) [with long unsigned int ...I = {0}; T = {arrow::internal::DataMemberProperty<arrow::compute::SortOptions, std::vector<arrow::compute::SortKey, std::allocator<arrow::compute::SortKey> > >}; Fn = arrow::compute::internal::ToStructScalarImpl<arrow::compute::SortOptions>&]'
/home/lidavidm/Code/upstream/arrow-13025/cpp/src/arrow/util/reflection_internal.h:72:25:   required from 'void arrow::internal::ForEachTupleMember(const std::tuple<_Tps ...>&, Fn&&) [with T = {arrow::internal::DataMemberProperty<arrow::compute::SortOptions, std::vector<arrow::compute::SortKey, std::allocator<arrow::compute::SortKey> > >}; Fn = arrow::compute::internal::ToStructScalarImpl<arrow::compute::SortOptions>&]'
/home/lidavidm/Code/upstream/arrow-13025/cpp/src/arrow/util/reflection_internal.h:100:23:   required from 'void arrow::internal::PropertyTuple<Properties>::ForEach(Fn&&) const [with Fn = arrow::compute::internal::ToStructScalarImpl<arrow::compute::SortOptions>&; Properties = {arrow::internal::DataMemberProperty<arrow::compute::SortOptions, std::vector<arrow::compute::SortKey, std::allocator<arrow::compute::SortKey> > >}]'
/home/lidavidm/Code/upstream/arrow-13025/cpp/src/arrow/compute/function_internal.h:456:5:   required from 'arrow::compute::internal::ToStructScalarImpl<Options>::ToStructScalarImpl(const Options&, const Tuple&, std::vector<std::__cxx11::basic_string<char> >*, std::vector<std::shared_ptr<arrow::Scalar> >*) [with Tuple = arrow::internal::PropertyTuple<arrow::internal::DataMemberProperty<arrow::compute::SortOptions, std::vector<arrow::compute::SortKey, std::allocator<arrow::compute::SortKey> > > >; Options = arrow::compute::SortOptions]'
/home/lidavidm/Code/upstream/arrow-13025/cpp/src/arrow/compute/function_internal.h:536:7:   required from 'arrow::Status arrow::compute::internal::GetFunctionOptionsType(const Properties& ...)::OptionsType::ToStructScalar(const arrow::compute::FunctionOptions&, std::vector<std::__cxx11::basic_string<char> >*, std::vector<std::shared_ptr<arrow::Scalar> >*) const [with Options = arrow::compute::SortOptions; Properties = {arrow::internal::DataMemberProperty<arrow::compute::SortOptions, std::vector<arrow::compute::SortKey, std::allocator<arrow::compute::SortKey> > >}]'
/home/lidavidm/Code/upstream/arrow-13025/cpp/src/arrow/compute/function_internal.h:550:3:   required from 'const arrow::compute::FunctionOptionsType* arrow::compute::internal::GetFunctionOptionsType(const Properties& ...) [with Options = arrow::compute::SortOptions; Properties = {arrow::internal::DataMemberProperty<arrow::compute::SortOptions, std::vector<arrow::compute::SortKey, std::allocator<arrow::compute::SortKey> > >}]'
/home/lidavidm/Code/upstream/arrow-13025/cpp/src/arrow/compute/api_vector.cc:182:89:   required from here
/home/lidavidm/Code/upstream/arrow-13025/cpp/src/arrow/compute/function_internal.h:264:59: error: no matching function for call to 'GenericTypeSingleton<arrow::compute::SortKey>()'
  264 |   std::shared_ptr<DataType> type = GenericTypeSingleton<T>();
      |                                    ~~~~~~~~~~~~~~~~~~~~~~~^~

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, that makes sense

@bkietz bkietz closed this in 1430c93 Jun 30, 2021
c_options = move(GetResultValue(move(maybe_options)))
type_name = frombytes(c_options.get().options_type().type_name())
mapping = {
"array_sort": ArraySortOptions,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is their a reason the type_name is different from the C++ class name (which is the same as the Python class name)? If it was the same, we probably wouldn't need to hand-maintain this mapping.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No good reason, thanks for the idea. I filed ARROW-13235.

return options

def __repr__(self):
return frombytes(self.get_options().ToString())
Copy link
Member

@jorisvandenbossche jorisvandenbossche Jul 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably still want to include the name of the options in the output (now it's only the arguments).

In [9]: pc.TrimOptions("a")
Out[9]: {characters="a"}

Opened https://issues.apache.org/jira/browse/ARROW-13236 for this for the Python side (or is that something we also want to add on the C++ side? (so the output of {{ToString}} includes it))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was following the output of the existing serialization for Expressions which didn't directly include the name. But maybe that's worth including in both C++ and Python.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants