ARROW-3762: [C++/Python] Support reading Parquet BYTE_ARRAY columns containing over 2GB of data#3171
ARROW-3762: [C++/Python] Support reading Parquet BYTE_ARRAY columns containing over 2GB of data#3171wesm wants to merge 12 commits intoapache:masterfrom
Conversation
|
@kszucs we aren't running the "large_memory" unit tests in Travis CI. What do you think about having a Docker target where we can run these so they can at least be spot-checked periodically? |
|
I'm all done here, just will make sure the build is passing |
There was a problem hiding this comment.
I don't think it's useful to inline those if AppendNextOffset is not inlined.
There was a problem hiding this comment.
Good point. I'll inline AppendNextOffset then
|
This PR seems basically fine to me. I posted a few minor comments. |
…s. add failing test case for ARROW-3762. Add ChunkedBinaryBuilder, make BinaryBuilder Append methods inline
Change-Id: I0eced60a1f8e16096a1b441b622ba750d1d59ca6
…ction of arrow::compute::Datum Change-Id: I483059a545c69a9b25d543faad641785da6bea29
…row test suite passing Change-Id: Icb260f6ffc4f41ee7519653bf8d3f48c2da30091
Change-Id: I35ab3ace0e4ca7a80fc7d85e55ac55ea222b15dc
Change-Id: I8f0a35ae4e8581790f7731ee2ed023a54caf0f31
Change-Id: I7fac456a34aa81683fa7315ae1b287be7f0d16e0
Change-Id: I47f93c7d8561b83414ab34f709fec66a6eb462d2
Change-Id: I8266354f04c8e14819fe4c72d28474e09843c13c
…tOffset Change-Id: Ibfc09617b365c937e7af6a4943c274843f6e7a33
|
The inlining of BinaryBuilder methods produces a meaningful benchmark improvement before after |
|
Nice :-) |
Change-Id: I48147645784402e7cf004a82151d66f337d1664e
|
+1 |
|
@wesm created issue for running large memory tests ARROW-4046 |
|
thanks =) |
|
I still see this error when using 0.13.0, also tested with 0.12.0. The code I've tested this with is the exact same code as in ARROW-4046: |
|
You mean a different JIRA than https://issues.apache.org/jira/browse/ARROW-4046, right? Can you post this on the appropriate JIRA issue or create a new one so we can track this? Thanks |
|
Right, my mistake. I've meant this one: https://issues.apache.org/jira/browse/ARROW-3762 |
|
Hi @wesm , Big fan of your work! |
|
@yogeshg can you open a JIRA issue either in Arrow or Spark? I think that this is something that will have to be handled on the Spark side cc @BryanCutler |
|
I'll do that soon.
…On Fri, Jun 14, 2019, 6:05 PM Wes McKinney ***@***.***> wrote:
@yogeshg <https://github.com/yogeshg> can you open a JIRA issue either in
Arrow or Spark? I think that this is something that will have to be handled
on the Spark side cc @BryanCutler <https://github.com/BryanCutler>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3171?email_source=notifications&email_token=AAICSYEMEXKN2DPBHM2F5Y3P2Q54FA5CNFSM4GKKNLFKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXYNGHI#issuecomment-502321949>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAICSYFZN2G64YJWHWTDTZLP2Q54FANCNFSM4GKKNLFA>
.
|
|
@yogeshg , this might be the related issue from the Java MR Parquet Reader that Spark uses https://issues.apache.org/jira/browse/PARQUET-980, but please open another JIRA if it is not |
This patch ended up being a bit more of a bloodbath than I planned: please accept my apologies.
Associated changes in this patch:
As far as what code to review, focus efforts on
I'm going to tackle ARROW-2970 which should not be complicated after this patch; I will submit that as a PR after this is reviews and merged.