Skip to content

Conversation

@wesm
Copy link
Member

@wesm wesm commented Mar 12, 2019

I think it is not a bad idea to not fail on reading and writing streams that have no record batches, rather than raising an error.

This would require code changes in Java and JS at least, so I will need help from others if this is thought to be a good idea. Might be good to add some unit tests around this also

@wesm
Copy link
Member Author

wesm commented Mar 12, 2019

See integration test results:

https://gist.github.com/wesm/df111020b498f3e7b261b8667aced4e2

It looks like only C++ can consume its own input. The other 8 entries of the matrix fail

@wesm
Copy link
Member Author

wesm commented Mar 13, 2019

Hm, well this is a bit concerning. Failure in the integration tests does not fail the build

@fsaintjacques
Copy link
Contributor

Rebase and try again

@emkornfield
Copy link
Contributor

@wesm I can take up the Java side of things in the 0.14 (sorry a bit swamped at the moment) release time frame. Does it pay to discuss this on the ML to be sure there is consensus? (Apologies if I missed the thread).

@wesm
Copy link
Member Author

wesm commented Mar 14, 2019

yeah I think it would make sense to ensure there is agreement about what should occur with a stream with no batches

@emkornfield
Copy link
Contributor

@wesm were you going to follow up on the ML about this? Want me to?

@wesm
Copy link
Member Author

wesm commented May 20, 2019

@emkornfield I guess I dropped the ball. Do you want to ping the list about it? Then we can try to fix the Java and C++ implementations at least

@wesm wesm changed the title WIP ARROW-2119: [IntegrationTest] Add test case with a stream having no record batches ARROW-2119: [IntegrationTest] Add test case with a stream having no record batches May 23, 2019
@wesm
Copy link
Member Author

wesm commented May 23, 2019

I rebased and set the integration test to only run for C++

@wesm
Copy link
Member Author

wesm commented May 23, 2019

+1

@wesm wesm closed this in b3a4e95 May 23, 2019
@wesm wesm deleted the ARROW-2119 branch May 23, 2019 15:46
wesm pushed a commit that referenced this pull request May 31, 2019
Re: #3871, [ARROW-2119](https://issues.apache.org/jira/browse/ARROW-2119), and closes [ARROW-5396](https://issues.apache.org/jira/browse/ARROW-5396).

This PR updates the JS Readers and Writers to support files and streams with no RecordBatches. The approach here is two-fold:

1. If the Readers' source message stream terminates after reading the Schema message, the Reader will yield a dummy zero-length RecordBatch with the schema.
2. The Writer always writes the schema for any RecordBatch, but skips writing the RecordBatch field metadata if it's empty.

This is necessary because the reader and writer don't know about each other when they're communicating via the Node and DOM stream i/o primitives; they only know about the values pushed through the streams. Since the RecordBatchReader and Writer don't yield the Schema message as a standalone value, we pump the stream with a zero-length RecordBatch that contains the schema instead.

Author: ptaylor <paul.e.taylor@me.com>
Author: Wes McKinney <wesm+git@apache.org>

Closes #4373 from trxcllnt/js/fix-no-record-batches and squashes the following commits:

c860696 <Wes McKinney> Run no-batches integration test for JS also
86d192d <ptaylor> define an _InternalEmptyRecordBatch class to signal that the reader source stream has no RecordBatches
193b08d <ptaylor> ensure reader and writer support the case where a stream or file has a schema but no recordbatches
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants