-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-4810: [Format][C++] Add "LargeList" type with 64-bit offsets #3848
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
format/Schema.fbs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this might be backwards incompatible, because the type is encoded with an enum which is just an integer (https://google.github.io/flatbuffers/md__schemas.html), you probably need to put this at the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, fixed!
format/Schema.fbs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please document that this is type optional for implementations (unless we can get consensus on the mailing list if we want it to be globally supported)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, I fixed it!
|
This is ready to review in case there is more feedback! |
emkornfield
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a comment on the jira asking for both c++ and Java implementations before adding a type to schema.fbs might want to discuss this on the mailing list (sorry if I missed the Java side)
|
Thanks for pointing me to that. Yeah, there is currently no Java implementation for this I'm afraid (and I don't know the Java codebase well enough to implement it). If anybody has time to do it that would be great! |
|
If someone else doesn't speak up I can put it on my to do list, behind getting my currently open PRs checked in (unless this is blocking something, in which case I can reprioritize) |
|
Thanks, that would be amazing! I'm not really blocked and will just squash this PR for now and keep working on top of it. Let me know if I can help you with your currently open PRs! |
de53f93 to
c885e90
Compare
|
Actually, if you want to trade #3644 is missing the C++ side of the implementation if you can pick that up, I can prioritize the work on the java side for this. For now I will just clone this PR and start working on top of it, but if there is a good way to collaborate (on a different PR a using branches was mentioned, I don't know the exact mechanics) let me know. |
|
I have funding if you know any Java contractors who might be interested in doing work-for-hire on this project to help with work beyond what the handful of us volunteers are able to do |
|
I started working together with an 'ex'-JVM contractor today actally, maybe we can help out here. |
|
@maartenbreddels @pcmoritz I'm going to pick up the java side of things (unless you've already started). My aim is to have a PR out sometime towards the end of the week. |
|
A proof of concept java version (see caveats in commit message): 03956ca @pcmoritz let me know if you want to try to write the integration tests or can I can try to do it later in the week. Also if someone can point me to java benchmarks (and any other tests that should be run), so I can try to assess regressions and upstream API breakages I would appreciate it. |
👍 |
|
@emkornfield Thanks, this is great! I got started on the C++ side of DurationInterval and plan to have something by the end of the week. Unfortunately I can't commit to more at the moment :( |
|
Actually if you want to share what you have I wouldn't mind doing it myself for durationinterval I want to get more familiar with the code. I think we should discuss use cases for LargeList on the ML since I think it is limited by other parts of the arrow spec but I might be misunderstanding something |
|
I haven't gotten too far yet: master...pcmoritz:interval-cpp Most of the code will be pretty similar to this PR (in terms of what needs to be added, etc) :) Feel free to continue/make your own branch! |
|
Thanks. Also I commented on the JIRA, but my confusion over utility was based on the memory layout document to indicates arrays should still only be 32-bit but Message.fbs was updated quite a while ago to indicate 64-bit offsets. |
|
|
||
| // List with 64-bit offsets. This type is optional and currently | ||
| // only supported by the C++ implementation. | ||
| table LargeList { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fsaintjacques had a suggestion which I liked and I think is worth discussing to parameterize List instead of creating a new type. I know @wesm suggested he prefers this as a separate type like you have here, but I'm not sure what the rationale is.
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the principle this must be validated through the ML discussion. But here are some comments on the implementation.
| LIST, | ||
|
|
||
| /// A list of some logical data type, using 64-bit offsets | ||
| LARGE_LIST, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Putting this before the end of the enum will break the ABI. I'm not sure we care. @xhochy Comments ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We sadly break the ABI quite a lot, I've given up a bit on keeping it stable. I'll look into this in 3-6 months again.
|
|
||
| std::string ToString() const override; | ||
|
|
||
| std::string name() const override { return "large_list"; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or perhaps "list64".
| /// | ||
| /// List data is nested data where each value is a variable number of | ||
| /// child items. Lists can be recursively nested, for example | ||
| /// list(list(int32)). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"large_list(large_list(int32))".
| Status Visit(const BinaryArray& array) override { return VisitBinary(array); } | ||
|
|
||
| Status Visit(const ListArray& array) override { | ||
| template <typename ListArrayType, typename OffsetType> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps the offset type should be available as ListType::offset_type and LargeListType::offset_type.
| template <typename T> | ||
| inline typename std::enable_if<std::is_base_of<ListArray, T>::value, Status>::type | ||
| inline typename std::enable_if<std::is_base_of<ListArray, T>::value || | ||
| std::is_base_of<LargeListArray, T>::value, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps a BaseListArray base class would be useful (and perhaps also a BaseListType).
Also enable_if_list<T> in type_traits.h may deserve updating.
| ASSERT_EQ("counts", type.value_field()->name()); | ||
| } | ||
|
|
||
| TEST_F(TestLargeListArray, TestBuilderPreserveFieleName) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really want to copy/paste all ListArray tests instead of using some kind of parametrization?
| /// represent multiple different logical types. If no logical type is provided | ||
| /// at construction time, the class defaults to List<T> where t is taken from the | ||
| /// value_builder/values that the object is constructed with. | ||
| class ARROW_EXPORT LargeListBuilder : public ArrayBuilder { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of copy/pasting the ListBuilder implementation, how about factoring it into a templated base class that both concrete implementations can inherit from?
|
@pcmoritz were you going to take this up again? |
ed180da to
85fe336
Compare
|
Recommending to close this in favour of #4969 |
|
Closing in favor of #4969 which is merged |
No description provided.