Skip to content

Conversation

@pitrou
Copy link
Member

@pitrou pitrou commented Jul 22, 2019

These are like Binary and String respectively, except with 64-bit offsets
so as to allow extremely large individual values.

@emkornfield
Copy link
Contributor

@pitrou were you planning on adding the java implementation as well? Or were you looking for someone to collaborate on that?

@pitrou
Copy link
Member Author

pitrou commented Jul 23, 2019

@emkornfield I need someone else to do the Java implementation. Also I think that can be in a separate PR.

@pitrou
Copy link
Member Author

pitrou commented Jul 23, 2019

Note this was more or less reviewed in #4919, but the review comments seem to have been lost in the post-release merge hiccups.

@emkornfield would you like to do a final review of this? I'd like to move on rather quickly if possible :-)

@pitrou pitrou force-pushed the ARROW-750-large-binary-2 branch from e6418b0 to daa3802 Compare July 23, 2019 10:21
@emkornfield
Copy link
Contributor

I can review tonight if no one gets to it sooner. I can also help with the Java implementation of this if there are no other volunteers, but probably won't be able to that until next week (I had a poc commit for large lists). If this is really high priority I can try to rearrange some things to do more work this week.

The last email thread on new types seemed to indicate we want the two implementation checked in at the same time.

@pitrou
Copy link
Member Author

pitrou commented Jul 23, 2019

I don't know. I thought Java was int64-challenged? @wesm

@emkornfield
Copy link
Contributor

emkornfield commented Jul 23, 2019 via email

@codecov-io
Copy link

codecov-io commented Jul 23, 2019

Codecov Report

Merging #4921 into master will decrease coverage by 13.24%.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff             @@
##           master    #4921       +/-   ##
===========================================
- Coverage   88.51%   75.27%   -13.25%     
===========================================
  Files         908       57      -851     
  Lines      115478     3713   -111765     
  Branches     1418        0     -1418     
===========================================
- Hits       102218     2795    -99423     
+ Misses      12898      918    -11980     
+ Partials      362        0      -362
Impacted Files Coverage Δ
r/src/csv.cpp 100% <0%> (ø) ⬆️
python/pyarrow/ipc.pxi
cpp/src/arrow/csv/chunker-test.cc
cpp/src/parquet/column_page.h
cpp/src/parquet/bloom_filter-test.cc
cpp/src/arrow/array/builder_decimal.cc
cpp/src/plasma/client.cc
cpp/src/arrow/io/test-common.h
cpp/src/arrow/util/int-util-test.cc
cpp/src/arrow/python/io.cc
... and 809 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ff8c63b...bd59891. Read the comment docs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be inside the valid_bytes value?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what you mean. Can you elaborate?
Also note this is mostly moving method definitions around, there is little to no new code in this file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the if block above on line 135?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't that be incorrect? You need to build the null bitmap somehow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not for this change, but I am kind of curious why we made zero out in buffer builder always versus only zero-ing for the bitmap (which I think is the only thing that requires it).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really know... Ideally zeroing the padding should be cheap operation, but perhaps not if you resize often.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have to look I think when we resize we pad everything from size to capacity. If capacity doubles size each time, it seems like it is effectively an O(N) cost?

anyways for a separate PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's possible. Can you open a JIRA?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use TypedBufferBuilder instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really see the point. It's a preallocated temporary container.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main rationale, I had for this it is theoretically "data" and not metadata. For data related allocations I think we should be consistently be trying to use memory backed structures so users get a consistent OOM experience. Not necessary for this PR>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to push this method to ArrayType?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fail to find a good, non-confusing name for it... Any idea?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CalculateOffsetBufferLength?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The offsets buffer could be larger, e.g. if the array is sliced.

@pitrou pitrou force-pushed the ARROW-750-large-binary-2 branch from daa3802 to bd59891 Compare July 24, 2019 10:19
@pitrou
Copy link
Member Author

pitrou commented Jul 25, 2019

@emkornfield You may want to review again.

@emkornfield
Copy link
Contributor

emkornfield commented Jul 25, 2019 via email

Copy link
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think at this point my comments are mostly nits or things to follow-up.

The one remaining question on if we should check this in or wait for the Java implementation (like I said, I think I should be able to have something next week)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the if block above on line 135?

Copy link
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to request changes just so we don't hastily merge, but per my previous comment, I think the main question now is resolving the Java side.

@pitrou
Copy link
Member Author

pitrou commented Jul 26, 2019

I'll take a look at most comments on Monday. As for the Java side, though, I'd rather not wait for it as #4927 also pends on this PR.

@emkornfield
Copy link
Contributor

As for the Java side, though, I'd rather not wait for it as #4927 also pends on this PR.

Maybe be worth sending something to the ML?

@pitrou
Copy link
Member Author

pitrou commented Jul 26, 2019

Maybe be worth sending something to the ML?

Is that necessary? Since you're planning to do the Java soon, it won't be long until there are two independent implementations of this feature.

@wesm Thoughts about the process here?

@wesm
Copy link
Member

wesm commented Jul 26, 2019

I'll write an e-mail to the ML about it now. We definitely should not release again without the other implementation landing

pitrou added 2 commits July 29, 2019 11:25
These are like Binary and String respectively, except with 64-bit offsets
so as to allow extremely large individual values.
@pitrou pitrou force-pushed the ARROW-750-large-binary-2 branch from bd59891 to 3e17dcd Compare July 29, 2019 09:44
@pitrou
Copy link
Member Author

pitrou commented Jul 29, 2019

I rebased and addressed most of the review comments. Also (re-)added some tests.

@pitrou
Copy link
Member Author

pitrou commented Jul 29, 2019

@emkornfield
Copy link
Contributor

+1, LGTM. (Feel free to merge, otherwise I'll do so in the morning to give a little bit more time to see if anyone responds to @wesm's e-mail).

@emkornfield
Copy link
Contributor

I might be more randomized this week than I expected but still think a Java implementation should be doable.

@pitrou pitrou closed this in 6e8607f Jul 30, 2019
@pitrou pitrou deleted the ARROW-750-large-binary-2 branch July 30, 2019 07:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants