-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-5115: [JS] Add Vector Builders and high-level stream primitives #4476
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #4476 +/- ##
==========================================
+ Coverage 88.28% 90.02% +1.73%
==========================================
Files 846 101 -745
Lines 103662 6404 -97258
Branches 1253 1418 +165
==========================================
- Hits 91519 5765 -85754
+ Misses 11896 629 -11267
+ Partials 247 10 -237
Continue to review full report at Codecov.
|
ea8dcee to
065c826
Compare
|
This is a pretty epic piece of work, @TheNeuralBit @domoritz I assume this is on your radar to review? |
domoritz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is way too much code for me to review right now, unfortunately.
js/package.json
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some pint, we should replace tslint with slint since the former is being deprecated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes I imagine at some point we should switch over.
TheNeuralBit
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm in the same boat as Dom, not sure I can devote enough time to review this thoroughly. Maybe we should schedule some time to just walk through it together.
js/package.json
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this just tying us to master? Is there something we need that hasn't been released?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes unfortunately :/. Each TypeDoc release is pinned to a specific version of the TypeScript compiler, and as of this commit, the latest public release on npm is on TS 3.2.4, but master is on 3.4.5. We should update this dependency when TypeDoc publishes master to npm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this a no-op?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes it is. I had this loop in while I was debugging an edge case and forgot to remove it once it was solved.
js/package.json
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice if changes like this TS upgrade were on a separate branch/PR in the future. 10k added lines is a lot to review and the changes that were necessary for this upgrade (as opposed to the Builders) will be lost in the noise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes to update to TS@3.5.1 were relatively minor changes to the type mapping interfaces, and are all in this commit: trxcllnt@492d1f1#diff-3cc15e0c860bac3718152b2c143638ad
I don't disagree on preferring this to be split out and handled separately, but practically speaking I was working under a number of constraints that made it more difficult. First, TypeScript released v3.3+ with breaking changes to type resolution halfway through the process of working on these features (leading to this issue).
Second, I was using the new Builders in app code as they were being developed. The apps needed TS 3.5, but wouldn't compile without these fixes, so I had to update Arrow. But the Builders branch was sufficiently different and would have been difficult to track and constantly rebase with that change (and wasn't even done yet), it was easier to do it here.
|
@TheNeuralBit that sounds good to me, I'm happy to schedule some time to do a walk through whenever you're free. I tried to model the Builders similar to the Vectors, both organizationally and in their public API. Most of the additions live in the Creating and using Builders is also very similar to Vectors: And lastly, I added a basic implementation for const { Table, Field, Map_, Int32, Utf8 } = require('apache-arrow');
const table = Table.from({
type: new Map_([new Field('int', new Int32()), new Field('str', new Utf8())]),
values: [
{ 'int': 0, str: 'hello' },
{ 'int': 1, str: 'world' }
]
});
console.table([...table].map((row) => row.toJSON());Also, check out the small |
… null, add Int64 builder tests
…e pointer to original instance
…he Vector Builders
a32df3e to
8b0752f
Compare
|
Just a note that my team is looking forward to this feature being available. Our backend serves binary protocol buffers with custom bit packing strategies, so we do a fair amount of dynamic Arrow table construction. I expect Builders will allow us to remove some pretty repetitive code. |
|
@trxcllnt @TheNeuralBit think we can get this merged this week? Let me know if there's something I can do to help |
|
@wesm yes, I'd love to get this (and the delta dictionaries PR #4502) merged this week for the 0.14 release. I've been using them in side projects and they're pretty stable, I think we're just blocked on a review here. @TheNeuralBit let me know if you have time to schedule a walkthrough some time this week/weekend, or @wesm I'd also be happy to do one with you. |
|
Paul and I are meeting up today to go over this. Merging this week seems reasonable. |
TheNeuralBit
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for walking me through this, and thanks for implementing it. It's a great addition and I think it will make it a lot easier for people to start using arrow JS.
Just a couple of minor comments based on our discussion.
|
Thanks! I'll merge when CI passes |
…DictionaryBuilder Adds support for building and writing delta dictionaries. Moves the `dictionary` Vector pointer to the Data class, similar to #4316. Forked from #4476 since this adds support for delta dictionaries to the DictionaryBuilder. Will rebase this PR after that's merged. All the work is in the last commit, here: b12d842 Author: ptaylor <paul.e.taylor@me.com> Closes #4502 from trxcllnt/js/delta-dictionaries and squashes the following commits: 6a70a25 <ptaylor> make dictionarybuilder and recordbatchwriter support delta dictionaries
…DictionaryBuilder Adds support for building and writing delta dictionaries. Moves the `dictionary` Vector pointer to the Data class, similar to apache/arrow#4316. Forked from apache/arrow#4476 since this adds support for delta dictionaries to the DictionaryBuilder. Will rebase this PR after that's merged. All the work is in the last commit, here: apache/arrow@b12d842 Author: ptaylor <paul.e.taylor@me.com> Closes #4502 from trxcllnt/js/delta-dictionaries and squashes the following commits: 6a70a25bd <ptaylor> make dictionarybuilder and recordbatchwriter support delta dictionaries
…DictionaryBuilder Adds support for building and writing delta dictionaries. Moves the `dictionary` Vector pointer to the Data class, similar to apache/arrow#4316. Forked from apache/arrow#4476 since this adds support for delta dictionaries to the DictionaryBuilder. Will rebase this PR after that's merged. All the work is in the last commit, here: apache/arrow@b12d842 Author: ptaylor <paul.e.taylor@me.com> Closes #4502 from trxcllnt/js/delta-dictionaries and squashes the following commits: 6a70a25bd <ptaylor> make dictionarybuilder and recordbatchwriter support delta dictionaries
This PR adds Vector Builder implementations for each DataType, as well as high-level stream primitives for Iterables/AsyncIterables, node streams, and DOM streams.
edit: I've created a demo that transforms a CSV file/stream of JSON rows to an Arrow table in this repository: https://github.com/trxcllnt/csv-to-arrow-js
Builder API
The new
Builderclass exposes an API for sequentially appending (or setting into slots that have already been allocated) arbitrary JavaScript values that will be flushed to the same underlying Data chunk. TheBuilderclass also supports specifying a list of null-value sentinels, or values that will be interpreted to indicate "null" should be written to the null bitmap instead of being written as a valid element.Similar to the existing
VectorAPI,Builderhas a staticBuilder.new()method that will return the correctBuildersubclass instance for the supplied DataType. Since theBuilderconstructor takes an options Object, this method also takes an Object:The
Builderclass has two methods for flushing the pending values to their underlying ArrayBuffer representations:flush(): Data<T>andtoVector(): Vector<T>(toVector()callsflush()and creates aVectorinstance from the returned Data instance).Calling
Builder.prototype.finish()will finalize aBuilderinstance. After this, no more values should be written to the Builder instance. This is a no-op on for most types, except theDictionaryBuilder, which flushes its internal dictionary and writes the values to theDictionarytype'sdictionaryVectorfield.Iterable and stream APIs
Creating and using Builders directly is a bit cumbersome, so we provide some high-level streaming APIs for automatically creating builders, appending values, and flushing chunks of a certain size:
Iterables and AsyncIterables
The static
throughIterableandthroughAsyncIterablemethods take anoptionsargument that indicates the Builder's type and null-value sentinels, and returns a function which accepts an Iterable or AsyncIterable, respectively, of values to transform:The
optionsargument can also specify aqueueingStrategyandhighWaterMarkthat control the chunking semantics:queueingStrategyis"count"(or is omitted), then the returned generator function will flush theBuilderand yield a chunk once the number of values that have been written to the Builder reaches the value supplied forhighWaterMark, regardless of how much memory theBuilderhas allocated.queueingStrategyis"bytes", then the returned generator function will flush theBuilderand yield a new chunk once the Builder'sbyteLengthfield reaches or exceeds the value supplied forhighWaterMark, regardless of how many elements theBuildercontains.Node and DOM Streams
In addition to the Iterable transform APIs, we can also create node and DOM transform streams with similar options:
Miscellaneous
Object.setPrototypeOf(), yielding a 4x speedupbiginttypes if available_InternalEmptyPlaceholderRecordBatchclass added in ARROW-5396: [JS] Support files and streams with no record batches #4373