Skip to content

Conversation

@trxcllnt
Copy link
Contributor

@trxcllnt trxcllnt commented Jun 5, 2019

This PR adds Vector Builder implementations for each DataType, as well as high-level stream primitives for Iterables/AsyncIterables, node streams, and DOM streams.

edit: I've created a demo that transforms a CSV file/stream of JSON rows to an Arrow table in this repository: https://github.com/trxcllnt/csv-to-arrow-js

Builder API

The new Builder class exposes an API for sequentially appending (or setting into slots that have already been allocated) arbitrary JavaScript values that will be flushed to the same underlying Data chunk. The Builder class also supports specifying a list of null-value sentinels, or values that will be interpreted to indicate "null" should be written to the null bitmap instead of being written as a valid element.

Similar to the existing Vector API, Builder has a static Builder.new() method that will return the correct Builder subclass instance for the supplied DataType. Since the Builder constructor takes an options Object, this method also takes an Object:

import { Builder, Utf8 } from 'apache-arrow';

const utf8Builder = Builder.new({
    type: new Utf8(),
    nullValues: [null, 'n/a']
});

utf8Builder
    .append('hello')
    .append('n/a') // will be interpreted to mean `null`
    .append('world')
    .append(null);

const utf8Vector = utf8Builder.finish().toVector();

console.log(utf8Vector.toJSON());
// > ["hello", null, "world", null]

The Builder class has two methods for flushing the pending values to their underlying ArrayBuffer representations: flush(): Data<T> and toVector(): Vector<T> (toVector() calls flush() and creates a Vector instance from the returned Data instance).

Calling Builder.prototype.finish() will finalize a Builder instance. After this, no more values should be written to the Builder instance. This is a no-op on for most types, except the DictionaryBuilder, which flushes its internal dictionary and writes the values to the Dictionary type's dictionaryVector field.

Iterable and stream APIs

Creating and using Builders directly is a bit cumbersome, so we provide some high-level streaming APIs for automatically creating builders, appending values, and flushing chunks of a certain size:

Builder.throughIterable(options: IterableBuilderOptions<T, TNull>)
Builder.throughAsyncIterable(options: IterableBuilderOptions<T, TNull>)
Builder.throughDOM(options: BuilderTransformOptions<T, TNull>)
Builder.throughNode(options: BuilderDuplexOptions<T, TNull>)

Iterables and AsyncIterables

The static throughIterable and throughAsyncIterable methods take an options argument that indicates the Builder's type and null-value sentinels, and returns a function which accepts an Iterable or AsyncIterable, respectively, of values to transform:

import { Chunked, Builder, Utf8 } from 'apache-arrow';
const options = { type: new Utf8(), nullValues: [null, 'n/a'] };
const buildUtf8 = Builder.throughIterable(options);
const utf8Vector = Chunked.concat(...buildUtf8(['hello', 'n/a', 'world', null]));

The options argument can also specify a queueingStrategy and highWaterMark that control the chunking semantics:

  • If the queueingStrategy is "count" (or is omitted), then the returned generator function will flush the Builder and yield a chunk once the number of values that have been written to the Builder reaches the value supplied for highWaterMark, regardless of how much memory the Builder has allocated.
  • If the queueingStrategy is "bytes", then the returned generator function will flush the Builder and yield a new chunk once the Builder's byteLength field reaches or exceeds the value supplied for highWaterMark, regardless of how many elements the Builder contains.

Node and DOM Streams

In addition to the Iterable transform APIs, we can also create node and DOM transform streams with similar options:

import { Readable } from 'stream';
import { toArray } from 'ix/asynciterable/toarray';
import { Chunked, Builder, Utf8 } from 'apache-arrow';
const options = {
  type: new Utf8(),
  nullValues: [undefined, 'n/a'],
  queueingStrategy: 'bytes',
  highWaterMark: 64, // flush each chunk once 64 bytes have been written
};

const utf8Vector = Chunked.concat(await toArray(
  Readable
    .from(['hello', 'n/a', 'world', undefined])
    .pipe(Builder.throughNode(options))
));

Miscellaneous

@trxcllnt trxcllnt requested a review from TheNeuralBit June 5, 2019 04:43
@codecov-io
Copy link

codecov-io commented Jun 5, 2019

Codecov Report

Merging #4476 into master will increase coverage by 1.73%.
The diff coverage is 86.52%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4476      +/-   ##
==========================================
+ Coverage   88.28%   90.02%   +1.73%     
==========================================
  Files         846      101     -745     
  Lines      103662     6404   -97258     
  Branches     1253     1418     +165     
==========================================
- Hits        91519     5765   -85754     
+ Misses      11896      629   -11267     
+ Partials      247       10     -237
Impacted Files Coverage Δ
js/src/visitor/vectorloader.ts 98.75% <ø> (ø) ⬆️
js/src/vector/decimal.ts 100% <ø> (ø) ⬆️
js/src/vector/binary.ts 66.66% <ø> (ø) ⬆️
js/src/io/node/writer.ts 100% <ø> (ø)
js/src/util/compat.ts 65.78% <ø> (-0.88%) ⬇️
js/src/visitor/vectorctor.ts 100% <ø> (+10.93%) ⬆️
js/src/vector/map.ts 77.77% <ø> (ø) ⬆️
js/src/visitor/typecomparator.ts 53.78% <ø> (ø) ⬆️
js/src/compute/dataframe.ts 92.15% <ø> (ø) ⬆️
js/src/vector.ts 100% <ø> (ø) ⬆️
... and 892 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1a00cf6...7998d2a. Read the comment docs.

@wesm
Copy link
Member

wesm commented Jun 7, 2019

This is a pretty epic piece of work, @TheNeuralBit @domoritz I assume this is on your radar to review?

Copy link
Member

@domoritz domoritz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is way too much code for me to review right now, unfortunately.

js/package.json Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some pint, we should replace tslint with slint since the former is being deprecated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes I imagine at some point we should switch over.

Copy link
Member

@TheNeuralBit TheNeuralBit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in the same boat as Dom, not sure I can devote enough time to review this thoroughly. Maybe we should schedule some time to just walk through it together.

js/package.json Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just tying us to master? Is there something we need that hasn't been released?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes unfortunately :/. Each TypeDoc release is pinned to a specific version of the TypeScript compiler, and as of this commit, the latest public release on npm is on TS 3.2.4, but master is on 3.4.5. We should update this dependency when TypeDoc publishes master to npm.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this a no-op?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes it is. I had this loop in while I was debugging an edge case and forgot to remove it once it was solved.

js/package.json Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if changes like this TS upgrade were on a separate branch/PR in the future. 10k added lines is a lot to review and the changes that were necessary for this upgrade (as opposed to the Builders) will be lost in the noise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes to update to TS@3.5.1 were relatively minor changes to the type mapping interfaces, and are all in this commit: trxcllnt@492d1f1#diff-3cc15e0c860bac3718152b2c143638ad

I don't disagree on preferring this to be split out and handled separately, but practically speaking I was working under a number of constraints that made it more difficult. First, TypeScript released v3.3+ with breaking changes to type resolution halfway through the process of working on these features (leading to this issue).

Second, I was using the new Builders in app code as they were being developed. The apps needed TS 3.5, but wouldn't compile without these fixes, so I had to update Arrow. But the Builders branch was sufficiently different and would have been difficult to track and constantly rebase with that change (and wasn't even done yet), it was easier to do it here.

@trxcllnt
Copy link
Contributor Author

trxcllnt commented Jun 8, 2019

@TheNeuralBit that sounds good to me, I'm happy to schedule some time to do a walk through whenever you're free.

I tried to model the Builders similar to the Vectors, both organizationally and in their public API. Most of the additions live in the src/builders directory, or the src/Builder.ts base class, which I tried to annotate with docstrings as I went. I also followed a similar pattern to the Arrow C++ Builders, and centralized the low-level buffer operations into a few BufferBuilder classes here.

Creating and using Builders is also very similar to Vectors: Builder.new({ type: DataType }) returns a new Builder for the given type, and you can set() or append() any of the same types that are accepted or returned by the corresponding Vector.set() and Vector.get() calls, respectively. The high-level streaming APIs also accept the same options as Builder.new() to help make the transition from an imperative building routine to a streaming routine as easy as possible.

And lastly, I added a basic implementation for Vector.from() for all Vectors that use the corresponding Builder implementation to create a Vector from JS values for simple use-cases (and Table/RecordBatch that use the Struct or Map Builders under the hood). For example, now you can create a Table from JS objects rather easily:

const { Table, Field, Map_, Int32, Utf8 } = require('apache-arrow');
const table = Table.from({
  type: new Map_([new Field('int', new Int32()), new Field('str', new Utf8())]),
  values: [
    { 'int': 0, str: 'hello' },
    { 'int': 1, str: 'world' }
  ]
});
console.table([...table].map((row) => row.toJSON());

Also, check out the small csv-to-arrow-js demo for an example of using the AsyncIterable streaming API. I'm working this weekend on more docs and examples like this that we can consolidate into a better set of examples in the repo.

@bclinkinbeard
Copy link

Just a note that my team is looking forward to this feature being available. Our backend serves binary protocol buffers with custom bit packing strategies, so we do a fair amount of dynamic Arrow table construction. I expect Builders will allow us to remove some pretty repetitive code.

@wesm
Copy link
Member

wesm commented Jun 18, 2019

@trxcllnt @TheNeuralBit think we can get this merged this week? Let me know if there's something I can do to help

@trxcllnt
Copy link
Contributor Author

@wesm yes, I'd love to get this (and the delta dictionaries PR #4502) merged this week for the 0.14 release. I've been using them in side projects and they're pretty stable, I think we're just blocked on a review here. @TheNeuralBit let me know if you have time to schedule a walkthrough some time this week/weekend, or @wesm I'd also be happy to do one with you.

@TheNeuralBit
Copy link
Member

Paul and I are meeting up today to go over this. Merging this week seems reasonable.

Copy link
Member

@TheNeuralBit TheNeuralBit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for walking me through this, and thanks for implementing it. It's a great addition and I think it will make it a lot easier for people to start using arrow JS.

Just a couple of minor comments based on our discussion.

@TheNeuralBit
Copy link
Member

Thanks! I'll merge when CI passes

TheNeuralBit pushed a commit that referenced this pull request Jun 22, 2019
…DictionaryBuilder

Adds support for building and writing delta dictionaries. Moves the `dictionary` Vector pointer to the Data class, similar to #4316.

Forked from  #4476 since this adds support for delta dictionaries to the DictionaryBuilder. Will rebase this PR after that's merged. All the work is in the last commit, here: b12d842

Author: ptaylor <paul.e.taylor@me.com>

Closes #4502 from trxcllnt/js/delta-dictionaries and squashes the following commits:

6a70a25 <ptaylor> make dictionarybuilder and recordbatchwriter support delta dictionaries
kou pushed a commit to apache/arrow-js that referenced this pull request May 14, 2025
…DictionaryBuilder

Adds support for building and writing delta dictionaries. Moves the `dictionary` Vector pointer to the Data class, similar to apache/arrow#4316.

Forked from  apache/arrow#4476 since this adds support for delta dictionaries to the DictionaryBuilder. Will rebase this PR after that's merged. All the work is in the last commit, here: apache/arrow@b12d842

Author: ptaylor <paul.e.taylor@me.com>

Closes #4502 from trxcllnt/js/delta-dictionaries and squashes the following commits:

6a70a25bd <ptaylor> make dictionarybuilder and recordbatchwriter support delta dictionaries
QuietCraftsmanship pushed a commit to QuietCraftsmanship/arrow that referenced this pull request Jul 7, 2025
…DictionaryBuilder

Adds support for building and writing delta dictionaries. Moves the `dictionary` Vector pointer to the Data class, similar to apache/arrow#4316.

Forked from  apache/arrow#4476 since this adds support for delta dictionaries to the DictionaryBuilder. Will rebase this PR after that's merged. All the work is in the last commit, here: apache/arrow@b12d842

Author: ptaylor <paul.e.taylor@me.com>

Closes #4502 from trxcllnt/js/delta-dictionaries and squashes the following commits:

6a70a25bd <ptaylor> make dictionarybuilder and recordbatchwriter support delta dictionaries
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Arrow TS Excessively Deep Type Instantiations

7 participants