-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-5396: [JS] Support files and streams with no record batches #4373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@trxcllnt so the dummy RecordBatch is needed only to obtain the schema? Curious if this causes a zero-length RecordBatch to be subsequently sent With this change does this pass the integration test I added? |
|
@wesm yes it passed the integration test when I ran it locally. The change in this PR causes the RecordBatchStreamReader to yield a dummy zero-length RecordBatch if the stream is going to terminate without yielding at least one RecordBatch (here), but it also updates the Writer to ignore zero-length RecordBatches (here). This will cause the Writer not to write any zero-length RecordBatches, but I can't think of a case where that'd be a problem. If it is, we could instead use an internal flag to indicate whether a zero-length RecordBatch should or shouldn't be written :-). |
|
Interesting. We also have an integration test for length-0 RecordBatch objects so if this passes both tests then I suppose it's OK. I guess I should merge my no-record-batches patch with C++-only using it and then you can rebase this and enable JS so we can get a complete build? |
|
Ah, I forgot about that test. I thought you were asking about the no-record-batches integration test from PR #3871, which is what I tested locally. I don't think it'll pass the I'll make that change here shortly and update this PR. |
|
@wesm ok, I updated the Reader and Writer to use a new |
…a schema but no recordbatches
…ource stream has no RecordBatches
622c875 to
c860696
Compare
|
Rebased and enabled the no-batches integration test for JS |
Codecov Report
@@ Coverage Diff @@
## master #4373 +/- ##
==========================================
+ Coverage 88.34% 90.33% +1.99%
==========================================
Files 781 74 -707
Lines 98721 5464 -93257
Branches 1251 1253 +2
==========================================
- Hits 87212 4936 -82276
+ Misses 11273 520 -10753
+ Partials 236 8 -228
Continue to review full report at Codecov.
|
|
@TheNeuralBit do you mind reviewing the JS? |
|
Would you mind adding a docstring on |
|
@trxcllnt if you can add that comment I can go ahead and merge this |
|
I'm going ahead and merging this. Please feel free to submit a follow up PR to document this |
This PR adds Vector Builder implementations for each DataType, as well as high-level stream primitives for Iterables/AsyncIterables, node streams, and DOM streams. edit: I've created a demo that transforms a CSV file/stream of JSON rows to an Arrow table in this repository: https://github.com/trxcllnt/csv-to-arrow-js #### Builder API The new `Builder` class exposes an API for sequentially appending (or setting into slots that have already been allocated) arbitrary JavaScript values that will be flushed to the same underlying Data chunk. The `Builder` class also supports specifying a list of null-value sentinels, or values that will be interpreted to indicate "null" should be written to the null bitmap instead of being written as a valid element. Similar to the existing `Vector` API, `Builder` has a static `Builder.new()` method that will return the correct `Builder` subclass instance for the supplied DataType. Since the `Builder` constructor takes an options Object, this method also takes an Object: ```typescript import { Builder, Utf8 } from 'apache-arrow'; const utf8Builder = Builder.new({ type: new Utf8(), nullValues: [null, 'n/a'] }); utf8Builder .append('hello') .append('n/a') // will be interpreted to mean `null` .append('world') .append(null); const utf8Vector = utf8Builder.finish().toVector(); console.log(utf8Vector.toJSON()); // > ["hello", null, "world", null] ``` The `Builder` class has two methods for flushing the pending values to their underlying ArrayBuffer representations: `flush(): Data<T>` and `toVector(): Vector<T>` (`toVector()` calls `flush()` and creates a `Vector` instance from the returned Data instance). Calling `Builder.prototype.finish()` will finalize a `Builder` instance. After this, no more values should be written to the Builder instance. This is a no-op on for most types, except the `DictionaryBuilder`, which flushes its internal dictionary and writes the values to the `Dictionary` type's `dictionaryVector` field. #### Iterable and stream APIs Creating and using Builders directly is a bit cumbersome, so we provide some high-level streaming APIs for automatically creating builders, appending values, and flushing chunks of a certain size: ```typescript Builder.throughIterable(options: IterableBuilderOptions<T, TNull>) Builder.throughAsyncIterable(options: IterableBuilderOptions<T, TNull>) Builder.throughDOM(options: BuilderTransformOptions<T, TNull>) Builder.throughNode(options: BuilderDuplexOptions<T, TNull>) ``` #### Iterables and AsyncIterables The static `throughIterable` and `throughAsyncIterable` methods take an `options` argument that indicates the Builder's type and null-value sentinels, and returns a function which accepts an Iterable or AsyncIterable, respectively, of values to transform: ```typescript import { Chunked, Builder, Utf8 } from 'apache-arrow'; const options = { type: new Utf8(), nullValues: [null, 'n/a'] }; const buildUtf8 = Builder.throughIterable(options); const utf8Vector = Chunked.concat(...buildUtf8(['hello', 'n/a', 'world', null])); ``` The `options` argument can also specify a `queueingStrategy` and `highWaterMark` that control the chunking semantics: * If the `queueingStrategy` is `"count"` (or is omitted), then the returned generator function will flush the `Builder` and yield a chunk once the number of values that have been written to the Builder reaches the value supplied for `highWaterMark`, regardless of how much memory the `Builder` has allocated. * If the `queueingStrategy` is `"bytes"`, then the returned generator function will flush the `Builder` and yield a new chunk once the Builder's `byteLength` field reaches or exceeds the value supplied for `highWaterMark`, regardless of how many elements the `Builder` contains. #### Node and DOM Streams In addition to the Iterable transform APIs, we can also create node and DOM transform streams with similar options: ```typescript import { Readable } from 'stream'; import { toArray } from 'ix/asynciterable/toarray'; import { Chunked, Builder, Utf8 } from 'apache-arrow'; const options = { type: new Utf8(), nullValues: [undefined, 'n/a'], queueingStrategy: 'bytes', highWaterMark: 64, // flush each chunk once 64 bytes have been written }; const utf8Vector = Chunked.concat(await toArray( Readable .from(['hello', 'n/a', 'world', undefined]) .pipe(Builder.throughNode(options)) )); ``` #### Miscellaneous * Updates most dependencies, updates TypeScript to v3.5.1 (and resolves #4452) * Updates the BigInt compatibility type to use `Object.setPrototypeOf()`, yielding a 4x speedup * Updates Int64 and Uint64 set routines to accept native `bigint` types if available * Adds a docstring to the `_InternalEmptyPlaceholderRecordBatch` class added in #4373 Author: ptaylor <paul.e.taylor@me.com> Closes #4476 from trxcllnt/js/data-builders and squashes the following commits: 7998d2a <ptaylor> add createIsValidFunction example docstring 07fa443 <ptaylor> remove default dictionary hash function 8b0752f <ptaylor> fix possible AsyncRandomAccessFile race condition retrieving filehandle size 4d2a4f0 <ptaylor> regenerate flatbuffer source from current format schemas 05acad8 <ptaylor> fix minor row serialization issues to be compatible with console.table() c0f7f7b <ptaylor> fix a few minor formatting issues in arrow2csv 9c3a865 <ptaylor> ensure byteLength is calculated for offsets buffer ba755ad <ptaylor> use 53 bit hash fn to further avoid collisions 8659323 <ptaylor> add test for builder iterable byte queueing strategy 13b51db <ptaylor> print more details about each message f250845 <ptaylor> use a better default dictionary builder hash function 3adf555 <ptaylor> adds or updates most of the high-level Vector.from() methods to use the Vector Builders 5aea4f9 <ptaylor> Add more specific Int64 and Uint64 Builder tests e974444 <ptaylor> fix lint 5d4b0be <ptaylor> ensure bitmapbufferbuilder increments and decrements _popCount appropriately 4b5375e <ptaylor> remove unnecessary jsdoc b2556b7 <ptaylor> add docstring for _InternalEmptyPlaceholderRecordBatch 5b990c3 <ptaylor> Clean up typedoc output, update typedoc to master to use typescript@3.4.5 c2673ac <ptaylor> add initial builder jsdoc 893c74f <ptaylor> ensure ListBuilder supports random insertion dddba1a <ptaylor> ensure variablewidthbuilder supports random insertion 84beb68 <ptaylor> remove some property getters, clean up 0209e9d <ptaylor> update to typescript 3.5 c587c33 <ptaylor> add BufferBuilders, clean up public Builder API 86a6cd0 <ptaylor> update closure compiler dependency d032e2d <ptaylor> update dependencies fd46163 <ptaylor> fix nan checks 26387c7 <ptaylor> update typescript, finish streaming builders, add comprehensive builder tests 79bfd46 <ptaylor> fix node and dom builder streams 2b7a63c <ptaylor> update test types 0990344 <ptaylor> add Builder throughDOM and throughNode transform streams 15ac17c <ptaylor> use Object.setPrototypeOf to improve bn performance f708190 <ptaylor> use safe BigInt64Array constructor f651657 <ptaylor> update typescript, ts-jest, jest a307273 <ptaylor> fix readable-stream detection in firefox 4c2ef42 <ptaylor> add the rest of the builder types 9ba9965 <ptaylor> update row type inference sanity check f8760cb <ptaylor> enumerate each type key individually now that they're a real thing 70377a1 <ptaylor> add vectorname getter to chunked for completion d193141 <ptaylor> add helper method to return the stride for a datatype 385fee3 <ptaylor> fix typo 2f36741 <ptaylor> move stream methods to io folder fb2c9a2 <ptaylor> cleanup 1398ebd <ptaylor> don't clone Dictionary DataType instances in Schema assign to preserve pointer to original instance f7fe8c1 <ptaylor> update builder buffer padding 2b7b992 <ptaylor> ensure union typeids buffer is an Int8Array 6e64c40 <ptaylor> ensure builder allocates aligned buffers, null bitmaps initialized to null, add Int64 builder tests cf5f778 <ptaylor> return signed or unsigned 53bit integers 8a5c713 <ptaylor> show a better error message if gulp fails 6b7a20e <ptaylor> ensure 64-bit int builders use BN to check for nulls c609266 <ptaylor> fix bool builder, add primitive builder tests 2f423ec <ptaylor> fix date builder tests 6ebaa36 <ptaylor> WIP streaming data builders 9e5e3ea <ptaylor> fix bool builder, add primitive builder tests ff32cbb <ptaylor> fix date builder tests 738fabb <ptaylor> WIP streaming data builders
Re: #3871, ARROW-2119, and closes ARROW-5396.
This PR updates the JS Readers and Writers to support files and streams with no RecordBatches. The approach here is two-fold:
This is necessary because the reader and writer don't know about each other when they're communicating via the Node and DOM stream i/o primitives; they only know about the values pushed through the streams. Since the RecordBatchReader and Writer don't yield the Schema message as a standalone value, we pump the stream with a zero-length RecordBatch that contains the schema instead.