-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-2119: [IntegrationTest] Add test case with a stream having no record batches #3871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
See integration test results: https://gist.github.com/wesm/df111020b498f3e7b261b8667aced4e2 It looks like only C++ can consume its own input. The other 8 entries of the matrix fail |
|
Hm, well this is a bit concerning. Failure in the integration tests does not fail the build |
|
Rebase and try again |
|
@wesm I can take up the Java side of things in the 0.14 (sorry a bit swamped at the moment) release time frame. Does it pay to discuss this on the ML to be sure there is consensus? (Apologies if I missed the thread). |
|
yeah I think it would make sense to ensure there is agreement about what should occur with a stream with no batches |
|
@wesm were you going to follow up on the ML about this? Want me to? |
|
@emkornfield I guess I dropped the ball. Do you want to ping the list about it? Then we can try to fix the Java and C++ implementations at least |
|
I rebased and set the integration test to only run for C++ |
|
+1 |
Re: #3871, [ARROW-2119](https://issues.apache.org/jira/browse/ARROW-2119), and closes [ARROW-5396](https://issues.apache.org/jira/browse/ARROW-5396). This PR updates the JS Readers and Writers to support files and streams with no RecordBatches. The approach here is two-fold: 1. If the Readers' source message stream terminates after reading the Schema message, the Reader will yield a dummy zero-length RecordBatch with the schema. 2. The Writer always writes the schema for any RecordBatch, but skips writing the RecordBatch field metadata if it's empty. This is necessary because the reader and writer don't know about each other when they're communicating via the Node and DOM stream i/o primitives; they only know about the values pushed through the streams. Since the RecordBatchReader and Writer don't yield the Schema message as a standalone value, we pump the stream with a zero-length RecordBatch that contains the schema instead. Author: ptaylor <paul.e.taylor@me.com> Author: Wes McKinney <wesm+git@apache.org> Closes #4373 from trxcllnt/js/fix-no-record-batches and squashes the following commits: c860696 <Wes McKinney> Run no-batches integration test for JS also 86d192d <ptaylor> define an _InternalEmptyRecordBatch class to signal that the reader source stream has no RecordBatches 193b08d <ptaylor> ensure reader and writer support the case where a stream or file has a schema but no recordbatches
I think it is not a bad idea to not fail on reading and writing streams that have no record batches, rather than raising an error.
This would require code changes in Java and JS at least, so I will need help from others if this is thought to be a good idea. Might be good to add some unit tests around this also