-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-13025: [C++][Python] Add FunctionOptions::Equals/ToString/Serialize #10511
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@bkietz I'd appreciate if you had any feedback on the approach here/if this was in line with what you were thinking. |
|
@lidavidm thanks for working on this!
I agree that it's unfortunately necessary in this case:
We could add this to class FunctionOptionsType {
public:
virtual std::string name() const = 0;
virtual bool Compare(const FunctionOptions&, const FunctionOptions&) const = 0;
virtual std::string Stringify(const FunctionOptions&) const = 0;
virtual Result<std::unique_ptr<Buffer>> Serialize(const FunctionOptions&) const = 0;
virtual Result<std::unique_ptr<FunctionOptions>> Deserialize(const Buffer&) const = 0;
};
class FunctionOptions {
public:
virtual ~FunctionOptions() = default;
const FunctionOptionsType* options_type() const { return type_; }
bool Equals(const FunctionOptions& other) const {
return type_ = other.type_ && type_->Compare(*this, other);
}
// forward other methods to type_ as well
protected:
explicit FunctionOptions(const FunctionOptionsType* type) : type_(type) {}
const FunctionOptionsType* type_;
};
class FunctionRegistry {
public:
// Called alongside AddFunction
// will raise if another options type with the same name exists
Status AddFunctionOptionsType(const FunctionOptionsType*);
};
class Function {
public:
// nullptr if the function does not take options.
// This is also nice since we can assert correctly typed FunctionOptions
const FunctionOptionsType* options_type() const;
};
Yes, I'd prefer that implementation detail be kept internal. Please ensure ToScalar/FromScalar can't be accessed without including an _internal header. |
|
Can we explain the objectives and design constraints here? |
|
Also, if the goal is to serialize function options as a struct scalar using the IPC format, it would be good to check performance, because I fear there may be a lot of overhead doing so. |
|
One more thought: we should try to minimize the hand-writing of boilerplate. Is it possible to somehow define a subclass of |
|
There's an intriguing thing in Boost which seems to allow to treat a Another intriguing thing when looking for "C++ named tuple": |
The goal is to just add basic utility methods on FunctionOptions. Then we could make them serializable/pickleable (and by extension, we could then support pickling/unpickling things like Expressions, and properly compare them too). The main constraint is not assuming that all the subclasses are statically known, so that we can move the function implementations into a separate library eventually with ARROW-8891.
I'll add a benchmark.
Yes…I'll look into this, this initial refactor was rather tedious. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add operator== and PrintTo so these can just be written
| ASSERT_TRUE(cur.Equals(cur)) << cur.ToString(); | |
| ASSERT_EQ(cur, cur); |
cpp/src/arrow/compute/registry.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think FunctionOptionsType itself needs to be internal/opaque. Instead we can have the public methods be Serialize/Deserialize and let To/FromStructScalar be the internal details
cpp/src/arrow/compute/cast.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| static std::unique_ptr<internal::FunctionOptionsType> instance(new CastOptionsType()); | |
| return instance.get(); | |
| static const CastOptionsType instance; | |
| return &instance; |
|
Thanks! I'll circle back to this soon |
|
Alright, I've pushed half a stab at this - I'm going to pick up ARROW-11932 so I can write a generic serialization for |
4967267 to
a635db7
Compare
|
Thanks. Some of this applies to #10579 so I'll address that there and rebase. |
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
bkietz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only small nits remain, this is looking great
bkietz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for doing this!
|
|
||
| class ARROW_EXPORT CompareOptions : public FunctionOptions { | ||
| public: | ||
| explicit CompareOptions(CompareOperator op); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this is necessary; that's unfortunate. Could you open a follow up JIRA to remove the binding to CompareOptions and demote this to a deprecated struct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed ARROW-13219
|
|
||
| namespace internal { | ||
| template <> | ||
| struct EnumTraits<compute::SortOrder> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to put this here instead of api_vector.cc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found it had to be here to be considered by the overloads of the Generic* functions for SortKey, which in turn I had to place in this header or they wouldn't be considered by the generated class in GetFunctionOptionsType. If I move all the overloads for SortKey/SortOrder into api_vector.cc, then I get errors like this:
In file included from /home/lidavidm/Code/upstream/arrow-13025/cpp/src/arrow/compute/api_vector.cc:28:
/home/lidavidm/Code/upstream/arrow-13025/cpp/src/arrow/compute/function_internal.h: In instantiation of 'arrow::Result<std::shared_ptr<arrow::Scalar> > arrow::compute::internal::GenericToScalar(const std::vector<T>&) [with T = arrow::compute::SortKey]':
/home/lidavidm/Code/upstream/arrow-13025/cpp/src/arrow/compute/function_internal.h:462:34: required from 'void arrow::compute::internal::ToStructScalarImpl<Options>::operator()(const Property&, size_t) [with Property = arrow::internal::DataMemberProperty<arrow::compute::SortOptions, std::vector<arrow::compute::SortKey> >; Options = arrow::compute::SortOptions; size_t = long unsigned int]'
/home/lidavidm/Code/upstream/arrow-13025/cpp/src/arrow/util/reflection_internal.h:67:28: required from 'void arrow::internal::ForEachTupleMemberImpl(const std::tuple<_Elements ...>&, Fn&&, arrow::internal::index_sequence<I ...>) [with long unsigned int ...I = {0}; T = {arrow::internal::DataMemberProperty<arrow::compute::SortOptions, std::vector<arrow::compute::SortKey, std::allocator<arrow::compute::SortKey> > >}; Fn = arrow::compute::internal::ToStructScalarImpl<arrow::compute::SortOptions>&]'
/home/lidavidm/Code/upstream/arrow-13025/cpp/src/arrow/util/reflection_internal.h:72:25: required from 'void arrow::internal::ForEachTupleMember(const std::tuple<_Tps ...>&, Fn&&) [with T = {arrow::internal::DataMemberProperty<arrow::compute::SortOptions, std::vector<arrow::compute::SortKey, std::allocator<arrow::compute::SortKey> > >}; Fn = arrow::compute::internal::ToStructScalarImpl<arrow::compute::SortOptions>&]'
/home/lidavidm/Code/upstream/arrow-13025/cpp/src/arrow/util/reflection_internal.h:100:23: required from 'void arrow::internal::PropertyTuple<Properties>::ForEach(Fn&&) const [with Fn = arrow::compute::internal::ToStructScalarImpl<arrow::compute::SortOptions>&; Properties = {arrow::internal::DataMemberProperty<arrow::compute::SortOptions, std::vector<arrow::compute::SortKey, std::allocator<arrow::compute::SortKey> > >}]'
/home/lidavidm/Code/upstream/arrow-13025/cpp/src/arrow/compute/function_internal.h:456:5: required from 'arrow::compute::internal::ToStructScalarImpl<Options>::ToStructScalarImpl(const Options&, const Tuple&, std::vector<std::__cxx11::basic_string<char> >*, std::vector<std::shared_ptr<arrow::Scalar> >*) [with Tuple = arrow::internal::PropertyTuple<arrow::internal::DataMemberProperty<arrow::compute::SortOptions, std::vector<arrow::compute::SortKey, std::allocator<arrow::compute::SortKey> > > >; Options = arrow::compute::SortOptions]'
/home/lidavidm/Code/upstream/arrow-13025/cpp/src/arrow/compute/function_internal.h:536:7: required from 'arrow::Status arrow::compute::internal::GetFunctionOptionsType(const Properties& ...)::OptionsType::ToStructScalar(const arrow::compute::FunctionOptions&, std::vector<std::__cxx11::basic_string<char> >*, std::vector<std::shared_ptr<arrow::Scalar> >*) const [with Options = arrow::compute::SortOptions; Properties = {arrow::internal::DataMemberProperty<arrow::compute::SortOptions, std::vector<arrow::compute::SortKey, std::allocator<arrow::compute::SortKey> > >}]'
/home/lidavidm/Code/upstream/arrow-13025/cpp/src/arrow/compute/function_internal.h:550:3: required from 'const arrow::compute::FunctionOptionsType* arrow::compute::internal::GetFunctionOptionsType(const Properties& ...) [with Options = arrow::compute::SortOptions; Properties = {arrow::internal::DataMemberProperty<arrow::compute::SortOptions, std::vector<arrow::compute::SortKey, std::allocator<arrow::compute::SortKey> > >}]'
/home/lidavidm/Code/upstream/arrow-13025/cpp/src/arrow/compute/api_vector.cc:182:89: required from here
/home/lidavidm/Code/upstream/arrow-13025/cpp/src/arrow/compute/function_internal.h:264:59: error: no matching function for call to 'GenericTypeSingleton<arrow::compute::SortKey>()'
264 | std::shared_ptr<DataType> type = GenericTypeSingleton<T>();
| ~~~~~~~~~~~~~~~~~~~~~~~^~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, that makes sense
| c_options = move(GetResultValue(move(maybe_options))) | ||
| type_name = frombytes(c_options.get().options_type().type_name()) | ||
| mapping = { | ||
| "array_sort": ArraySortOptions, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is their a reason the type_name is different from the C++ class name (which is the same as the Python class name)? If it was the same, we probably wouldn't need to hand-maintain this mapping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No good reason, thanks for the idea. I filed ARROW-13235.
| return options | ||
|
|
||
| def __repr__(self): | ||
| return frombytes(self.get_options().ToString()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably still want to include the name of the options in the output (now it's only the arguments).
In [9]: pc.TrimOptions("a")
Out[9]: {characters="a"}
Opened https://issues.apache.org/jira/browse/ARROW-13236 for this for the Python side (or is that something we also want to add on the C++ side? (so the output of {{ToString}} includes it))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was following the output of the existing serialization for Expressions which didn't directly include the name. But maybe that's worth including in both C++ and Python.
This is a draft of adding more utility methods to FunctionOptions. It's not fully implemented (it needs rebasing + serialization isn't implemented for most options, plus there are various TODOs scattered). But before I proceed further, I wanted to get some feedback.
Some concerns I have: