-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-750: [Format] [C++] Add LargeBinary and LargeString types #4921
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
b32e7c9 to
e6418b0
Compare
|
@pitrou were you planning on adding the java implementation as well? Or were you looking for someone to collaborate on that? |
|
@emkornfield I need someone else to do the Java implementation. Also I think that can be in a separate PR. |
|
Note this was more or less reviewed in #4919, but the review comments seem to have been lost in the post-release merge hiccups. @emkornfield would you like to do a final review of this? I'd like to move on rather quickly if possible :-) |
e6418b0 to
daa3802
Compare
|
I can review tonight if no one gets to it sooner. I can also help with the Java implementation of this if there are no other volunteers, but probably won't be able to that until next week (I had a poc commit for large lists). If this is really high priority I can try to rearrange some things to do more work this week. The last email thread on new types seemed to indicate we want the two implementation checked in at the same time. |
|
I don't know. I thought Java was int64-challenged? @wesm |
|
It is for address spaces, but that was something we discussed changing (it
would still be a little challenging but at least the APIs could lie about
it)
…On Tuesday, July 23, 2019, Antoine Pitrou ***@***.***> wrote:
I don't know. I thought Java was int64-challenged? @wesm
<https://github.com/wesm>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4921?email_source=notifications&email_token=AEIKYDVFFNW7BXG2HMCQGNLQA4NG3A5CNFSM4IF35572YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2TN2GI#issuecomment-514252057>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AEIKYDW475PRUFF4B3RBLN3QA4NG3ANCNFSM4IF3557Q>
.
|
Codecov Report
@@ Coverage Diff @@
## master #4921 +/- ##
===========================================
- Coverage 88.51% 75.27% -13.25%
===========================================
Files 908 57 -851
Lines 115478 3713 -111765
Branches 1418 0 -1418
===========================================
- Hits 102218 2795 -99423
+ Misses 12898 918 -11980
+ Partials 362 0 -362Continue to review full report at Codecov.
|
cpp/src/arrow/array/builder_binary.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be inside the valid_bytes value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what you mean. Can you elaborate?
Also note this is mostly moving method definitions around, there is little to no new code in this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the if block above on line 135?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't that be incorrect? You need to build the null bitmap somehow.
cpp/src/arrow/array/builder_binary.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not for this change, but I am kind of curious why we made zero out in buffer builder always versus only zero-ing for the bitmap (which I think is the only thing that requires it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really know... Ideally zeroing the padding should be cheap operation, but perhaps not if you resize often.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have to look I think when we resize we pad everything from size to capacity. If capacity doubles size each time, it seems like it is effectively an O(N) cost?
anyways for a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's possible. Can you open a JIRA?
cpp/src/arrow/array/builder_binary.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use TypedBufferBuilder instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really see the point. It's a preallocated temporary container.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main rationale, I had for this it is theoretically "data" and not metadata. For data related allocations I think we should be consistently be trying to use memory backed structures so users get a consistent OOM experience. Not necessary for this PR>
cpp/src/arrow/ipc/feather.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to push this method to ArrayType?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fail to find a good, non-confusing name for it... Any idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CalculateOffsetBufferLength?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The offsets buffer could be larger, e.g. if the array is sliced.
daa3802 to
bd59891
Compare
|
@emkornfield You may want to review again. |
|
Yes, sorry will look tonight.
…On Thursday, July 25, 2019, Antoine Pitrou ***@***.***> wrote:
@emkornfield <https://github.com/emkornfield> You may want to review
again.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4921?email_source=notifications&email_token=AEIKYDVWCROOAT5H3L6MP23QBFUSTA5CNFSM4IF35572YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2Y3T2Q#issuecomment-514963946>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AEIKYDQXPJMNYJZ7PHF77TTQBFUSTANCNFSM4IF3557Q>
.
|
emkornfield
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think at this point my comments are mostly nits or things to follow-up.
The one remaining question on if we should check this in or wait for the Java implementation (like I said, I think I should be able to have something next week)
cpp/src/arrow/array/builder_binary.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the if block above on line 135?
emkornfield
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change to request changes just so we don't hastily merge, but per my previous comment, I think the main question now is resolving the Java side.
|
I'll take a look at most comments on Monday. As for the Java side, though, I'd rather not wait for it as #4927 also pends on this PR. |
Maybe be worth sending something to the ML? |
Is that necessary? Since you're planning to do the Java soon, it won't be long until there are two independent implementations of this feature. @wesm Thoughts about the process here? |
|
I'll write an e-mail to the ML about it now. We definitely should not release again without the other implementation landing |
These are like Binary and String respectively, except with 64-bit offsets so as to allow extremely large individual values.
bd59891 to
3e17dcd
Compare
|
I rebased and addressed most of the review comments. Also (re-)added some tests. |
|
+1, LGTM. (Feel free to merge, otherwise I'll do so in the morning to give a little bit more time to see if anyone responds to @wesm's e-mail). |
|
I might be more randomized this week than I expected but still think a Java implementation should be doable. |
These are like Binary and String respectively, except with 64-bit offsets
so as to allow extremely large individual values.