Skip to content

Conversation

@trxcllnt
Copy link
Contributor

@trxcllnt trxcllnt commented May 22, 2019

Re: #3871, ARROW-2119, and closes ARROW-5396.

This PR updates the JS Readers and Writers to support files and streams with no RecordBatches. The approach here is two-fold:

  1. If the Readers' source message stream terminates after reading the Schema message, the Reader will yield a dummy zero-length RecordBatch with the schema.
  2. The Writer always writes the schema for any RecordBatch, but skips writing the RecordBatch field metadata if it's empty.

This is necessary because the reader and writer don't know about each other when they're communicating via the Node and DOM stream i/o primitives; they only know about the values pushed through the streams. Since the RecordBatchReader and Writer don't yield the Schema message as a standalone value, we pump the stream with a zero-length RecordBatch that contains the schema instead.

@trxcllnt trxcllnt requested review from TheNeuralBit and wesm May 22, 2019 17:55
@wesm
Copy link
Member

wesm commented May 22, 2019

@trxcllnt so the dummy RecordBatch is needed only to obtain the schema? Curious if this causes a zero-length RecordBatch to be subsequently sent

With this change does this pass the integration test I added?

@trxcllnt
Copy link
Contributor Author

trxcllnt commented May 22, 2019

@wesm yes it passed the integration test when I ran it locally.

The change in this PR causes the RecordBatchStreamReader to yield a dummy zero-length RecordBatch if the stream is going to terminate without yielding at least one RecordBatch (here), but it also updates the Writer to ignore zero-length RecordBatches (here).

This will cause the Writer not to write any zero-length RecordBatches, but I can't think of a case where that'd be a problem. If it is, we could instead use an internal flag to indicate whether a zero-length RecordBatch should or shouldn't be written :-).

@wesm
Copy link
Member

wesm commented May 22, 2019

Interesting. We also have an integration test for length-0 RecordBatch objects so if this passes both tests then I suppose it's OK. I guess I should merge my no-record-batches patch with C++-only using it and then you can rebase this and enable JS so we can get a complete build?

@trxcllnt
Copy link
Contributor Author

Ah, I forgot about that test. I thought you were asking about the no-record-batches integration test from PR #3871, which is what I tested locally. I don't think it'll pass the generate_primitive_case([0, 0, 0], name='primitive_zerolength') test unless we alter the the Writer to be more intelligent about which zero-length RecordBatches should and shouldn't be written.

I'll make that change here shortly and update this PR.

@trxcllnt
Copy link
Contributor Author

trxcllnt commented May 22, 2019

@wesm ok, I updated the Reader and Writer to use a new _InternalEmptyPlaceholderRecordBatch subclass to indicate whether the input stream truly has no batches, or just a collection of zero-length batches. Looks like the C++ and JS integration tests now line up:

@wesm wesm force-pushed the js/fix-no-record-batches branch from 622c875 to c860696 Compare May 23, 2019 17:19
@wesm
Copy link
Member

wesm commented May 23, 2019

Rebased and enabled the no-batches integration test for JS

@codecov-io
Copy link

Codecov Report

Merging #4373 into master will increase coverage by 1.99%.
The diff coverage is 64.7%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4373      +/-   ##
==========================================
+ Coverage   88.34%   90.33%   +1.99%     
==========================================
  Files         781       74     -707     
  Lines       98721     5464   -93257     
  Branches     1251     1253       +2     
==========================================
- Hits        87212     4936   -82276     
+ Misses      11273      520   -10753     
+ Partials      236        8     -228
Impacted Files Coverage Δ
js/src/ipc/writer.ts 89.04% <100%> (+0.05%) ⬆️
js/src/ipc/message.ts 92.25% <100%> (-0.1%) ⬇️
js/src/recordbatch.ts 90.47% <33.33%> (-4.4%) ⬇️
js/src/ipc/reader.ts 87.22% <42.85%> (-0.92%) ⬇️
python/pyarrow/ipc.pxi
cpp/src/arrow/csv/chunker-test.cc
cpp/src/parquet/column_page.h
cpp/src/parquet/bloom_filter-test.cc
cpp/src/arrow/array/builder_decimal.cc
cpp/src/plasma/client.cc
... and 703 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9fcc12f...c860696. Read the comment docs.

@emkornfield
Copy link
Contributor

@TheNeuralBit do you mind reviewing the JS?

@TheNeuralBit
Copy link
Member

Would you mind adding a docstring on _InternalEmptyPlaceholderRecordBatch to make it clear why it exists? Otherwise LGTM
Some non-integration test-cases would be nice to, but we could just make a jira for that.

@wesm
Copy link
Member

wesm commented May 30, 2019

@trxcllnt if you can add that comment I can go ahead and merge this

@wesm
Copy link
Member

wesm commented May 31, 2019

I'm going ahead and merging this. Please feel free to submit a follow up PR to document this

@wesm wesm closed this in 17e0198 May 31, 2019
TheNeuralBit pushed a commit that referenced this pull request Jun 19, 2019
This PR adds Vector Builder implementations for each DataType, as well as high-level stream primitives for Iterables/AsyncIterables, node streams, and DOM streams.

edit: I've created a demo that transforms a CSV file/stream of JSON rows to an Arrow table in this repository: https://github.com/trxcllnt/csv-to-arrow-js

#### Builder API

The new `Builder` class exposes an API for sequentially appending (or setting into slots that have already been allocated) arbitrary JavaScript values that will be flushed to the same underlying Data chunk. The `Builder` class also supports specifying a list of null-value sentinels, or values that will be interpreted to indicate "null" should be written to the null bitmap instead of being written as a valid element.

Similar to the existing `Vector` API, `Builder` has a static `Builder.new()` method that will return the correct `Builder` subclass instance for the supplied DataType. Since the `Builder` constructor takes an options Object, this method also takes an Object:

```typescript
import { Builder, Utf8 } from 'apache-arrow';

const utf8Builder = Builder.new({
    type: new Utf8(),
    nullValues: [null, 'n/a']
});

utf8Builder
    .append('hello')
    .append('n/a') // will be interpreted to mean `null`
    .append('world')
    .append(null);

const utf8Vector = utf8Builder.finish().toVector();

console.log(utf8Vector.toJSON());
// > ["hello", null, "world", null]
```

The `Builder` class has two methods for flushing the pending values to their underlying ArrayBuffer representations: `flush(): Data<T>` and `toVector(): Vector<T>` (`toVector()` calls `flush()` and creates a `Vector` instance from the returned Data instance).

Calling `Builder.prototype.finish()` will finalize a `Builder` instance. After this, no more values should be written to the Builder instance. This is a no-op on for most types, except the `DictionaryBuilder`, which flushes its internal dictionary and writes the values to the `Dictionary` type's `dictionaryVector` field.

#### Iterable and stream APIs

Creating and using Builders directly is a bit cumbersome, so we provide some high-level streaming APIs for automatically creating builders, appending values, and flushing chunks of a certain size:

```typescript
Builder.throughIterable(options: IterableBuilderOptions<T, TNull>)
Builder.throughAsyncIterable(options: IterableBuilderOptions<T, TNull>)
Builder.throughDOM(options: BuilderTransformOptions<T, TNull>)
Builder.throughNode(options: BuilderDuplexOptions<T, TNull>)
```

#### Iterables and AsyncIterables
The static `throughIterable` and `throughAsyncIterable` methods take an `options` argument that indicates the Builder's type and null-value sentinels, and returns a function which accepts an Iterable or AsyncIterable, respectively, of values to transform:

```typescript
import { Chunked, Builder, Utf8 } from 'apache-arrow';
const options = { type: new Utf8(), nullValues: [null, 'n/a'] };
const buildUtf8 = Builder.throughIterable(options);
const utf8Vector = Chunked.concat(...buildUtf8(['hello', 'n/a', 'world', null]));
```

The `options` argument can also specify a `queueingStrategy` and `highWaterMark` that control the chunking semantics:
* If the `queueingStrategy` is `"count"` (or is omitted), then the returned generator function will flush the `Builder` and yield a chunk once the number of values that have been written to the Builder reaches the value supplied for `highWaterMark`, regardless of how much memory the `Builder` has allocated.
* If the `queueingStrategy` is `"bytes"`, then the returned generator function will flush the `Builder` and yield a new chunk once the Builder's `byteLength` field reaches or exceeds the value supplied for `highWaterMark`, regardless of how many elements the `Builder` contains.

#### Node and DOM Streams

In addition to the Iterable transform APIs, we can also create node and DOM transform streams with similar options:
```typescript
import { Readable } from 'stream';
import { toArray } from 'ix/asynciterable/toarray';
import { Chunked, Builder, Utf8 } from 'apache-arrow';
const options = {
  type: new Utf8(),
  nullValues: [undefined, 'n/a'],
  queueingStrategy: 'bytes',
  highWaterMark: 64, // flush each chunk once 64 bytes have been written
};

const utf8Vector = Chunked.concat(await toArray(
  Readable
    .from(['hello', 'n/a', 'world', undefined])
    .pipe(Builder.throughNode(options))
));
```

#### Miscellaneous
* Updates most dependencies, updates TypeScript to v3.5.1 (and resolves #4452)
* Updates the BigInt compatibility type to use `Object.setPrototypeOf()`, yielding a 4x speedup
* Updates Int64 and Uint64 set routines to accept native `bigint` types if available
* Adds a docstring to the `_InternalEmptyPlaceholderRecordBatch` class added in #4373

Author: ptaylor <paul.e.taylor@me.com>

Closes #4476 from trxcllnt/js/data-builders and squashes the following commits:

7998d2a <ptaylor> add createIsValidFunction example docstring
07fa443 <ptaylor> remove default dictionary hash function
8b0752f <ptaylor> fix possible AsyncRandomAccessFile race condition retrieving filehandle size
4d2a4f0 <ptaylor> regenerate flatbuffer source from current format schemas
05acad8 <ptaylor> fix minor row serialization issues to be compatible with console.table()
c0f7f7b <ptaylor> fix a few minor formatting issues in arrow2csv
9c3a865 <ptaylor> ensure byteLength is calculated for offsets buffer
ba755ad <ptaylor> use 53 bit hash fn to further avoid collisions
8659323 <ptaylor> add test for builder iterable byte queueing strategy
13b51db <ptaylor> print more details about each message
f250845 <ptaylor> use a better default dictionary builder hash function
3adf555 <ptaylor> adds or updates most of the high-level Vector.from() methods to use the Vector Builders
5aea4f9 <ptaylor> Add more specific Int64 and Uint64 Builder tests
e974444 <ptaylor> fix lint
5d4b0be <ptaylor> ensure bitmapbufferbuilder increments and decrements _popCount appropriately
4b5375e <ptaylor> remove unnecessary jsdoc
b2556b7 <ptaylor> add docstring for _InternalEmptyPlaceholderRecordBatch
5b990c3 <ptaylor> Clean up typedoc output, update typedoc to master to use typescript@3.4.5
c2673ac <ptaylor> add initial builder jsdoc
893c74f <ptaylor> ensure ListBuilder supports random insertion
dddba1a <ptaylor> ensure variablewidthbuilder supports random insertion
84beb68 <ptaylor> remove some property getters, clean up
0209e9d <ptaylor> update to typescript 3.5
c587c33 <ptaylor> add BufferBuilders, clean up public Builder API
86a6cd0 <ptaylor> update closure compiler dependency
d032e2d <ptaylor> update dependencies
fd46163 <ptaylor> fix nan checks
26387c7 <ptaylor> update typescript, finish streaming builders, add comprehensive builder tests
79bfd46 <ptaylor> fix node and dom builder streams
2b7a63c <ptaylor> update test types
0990344 <ptaylor> add Builder throughDOM and throughNode transform streams
15ac17c <ptaylor> use Object.setPrototypeOf to improve bn performance
f708190 <ptaylor> use safe BigInt64Array constructor
f651657 <ptaylor> update typescript, ts-jest, jest
a307273 <ptaylor> fix readable-stream detection in firefox
4c2ef42 <ptaylor> add the rest of the builder types
9ba9965 <ptaylor> update row type inference sanity check
f8760cb <ptaylor> enumerate each type key individually now that they're a real thing
70377a1 <ptaylor> add vectorname getter to chunked for completion
d193141 <ptaylor> add helper method to return the stride for a datatype
385fee3 <ptaylor> fix typo
2f36741 <ptaylor> move stream methods to io folder
fb2c9a2 <ptaylor> cleanup
1398ebd <ptaylor> don't clone Dictionary DataType instances in Schema assign to preserve pointer to original instance
f7fe8c1 <ptaylor> update builder buffer padding
2b7b992 <ptaylor> ensure union typeids buffer is an Int8Array
6e64c40 <ptaylor> ensure builder allocates aligned buffers, null bitmaps initialized to null, add Int64 builder tests
cf5f778 <ptaylor> return signed or unsigned 53bit integers
8a5c713 <ptaylor> show a better error message if gulp fails
6b7a20e <ptaylor> ensure 64-bit int builders use BN to check for nulls
c609266 <ptaylor> fix bool builder, add primitive builder tests
2f423ec <ptaylor> fix date builder tests
6ebaa36 <ptaylor> WIP streaming data builders
9e5e3ea <ptaylor> fix bool builder, add primitive builder tests
ff32cbb <ptaylor> fix date builder tests
738fabb <ptaylor> WIP streaming data builders
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants