ARROW-3336: [JS] Fix IPC writer serializing sliced arrays#2638
ARROW-3336: [JS] Fix IPC writer serializing sliced arrays#2638trxcllnt wants to merge 6 commits intoapache:masterfrom
Conversation
|
looks like JS is passing, but the R build has failed |
|
@TheNeuralBit also commit 49c601c makes the Table constructor arg handling significantly less confusing |
…ble-width vectors, as all other buffers have already been adjusted as part of the original call to Vector#slice()
e473e73 to
bcd58ca
Compare
| this.addBuffer(this.getZeroBasedValueOffsets(0, length, valueOffsets)); | ||
| } | ||
| // Then insert the List's values child | ||
| return this.visit((vector as any as ListVector<T>).getChildAt(0)!); |
There was a problem hiding this comment.
What went wrong here, exactly? Is the offset being handled elsewhere?
There was a problem hiding this comment.
There was a problem hiding this comment.
@wesm Brian is right, most of the internal buffers are zero-copy sliced up front in the ArrayData internals. This is mostly a performance optimization/complexity reduction, since it allows us to index into those TypedArrays starting from zero in most cases. Notable exceptions are when the indexing is dependent on other buffers (e.g. we do slice the valueOffsets buffer of a Utf8Vector, but not the data buffer), or when multiple values are packed into < 1 byte like the validity bitmap or BoolVector's data buffer. So we still need to track the logical offset internally, but only use it in those special cases.
There was a problem hiding this comment.
@wesm if you want to see a simplified version of the internal buffer management/slicing that's going on here, I have one here in my unfinished js-data-refactor branch that I'm reeeally looking forward to getting time to finish soon.
There was a problem hiding this comment.
Sounds good, looking forward to getting that in :)
TheNeuralBit
left a comment
There was a problem hiding this comment.
Couple of minor comments, but this looks good. Thanks for adding tests to cover the original problem :)
| (expect([v1, `batch`, name]) as any).toEqualVector([v0, `source`]); | ||
| (expect([v2, `table`, name]) as any).toEqualVector([v0, `source`]); | ||
| } | ||
| } |
There was a problem hiding this comment.
What do you think about pulling some of this logic out into a toEqualTable function? I think that would be generally useful, especially if we want to start doing some round-trip tests now that we have a writer.
There was a problem hiding this comment.
Yeah totally, we should add more helpers to the jest-extensions file. We can also clean up the tests to take advantage of TS's new type X = import('some-module').X feature, to import just type definitions and not the full implementation. I think that would simplify the dynamic import logic in test/Arrow.ts and hopefully make it impossible to accidentally do this.
| // If we have valueOffsets (ListVector), push that buffer first | ||
| if (valueOffsets) { | ||
| this.addBuffer(this.getZeroBasedValueOffsets(offset, length, valueOffsets)); | ||
| this.addBuffer(this.getZeroBasedValueOffsets(0, length, valueOffsets)); |
There was a problem hiding this comment.
It looks like getZeroBasedValueOffsets is always called with a hard-coded 0 now, which makes it a no-op. Should we just get rid of it?
There was a problem hiding this comment.
@TheNeuralBit you're right that we subarray the valueOffsets buffer when we slice, but we still need to call getZeroBasedValueOffsets() in all these places because it also does work in the case that valueOffsets[0] != 0. That will be true if we performed any meaningful valueOffsets.subarray() slice in the data's sliceInternal() method, in which case we need to adjust all the values in the valueOffsets Array on the way out.
| this.addBuffer(this.getZeroBasedValueOffsets(0, length, valueOffsets)); | ||
| } | ||
| // Then insert the List's values child | ||
| return this.visit((vector as any as ListVector<T>).getChildAt(0)!); |
There was a problem hiding this comment.
|
No more comments from me, let me know if you want me to merge this or anything else |
|
@wesm I'm happy to do the merge also, if for no other reason than to get my PR merge stats up :-) |
|
OK, I'll leave it to you both |
I just noticed this commit wasn't included in #2638 :-[. My bad, here's the rest of the fix for ARROW-3336 Author: ptaylor <paul.e.taylor@me.com> Closes #2742 from trxcllnt/js-fix-writing-sliced-arrays and squashes the following commits: 3ab9da0 <ptaylor> fix serializing sliced Utf8 vectors
I just noticed this commit wasn't included in apache/arrow#2638 :-[. My bad, here's the rest of the fix for ARROW-3336 Author: ptaylor <paul.e.taylor@me.com> Closes #2742 from trxcllnt/js-fix-writing-sliced-arrays and squashes the following commits: 3ab9da080 <ptaylor> fix serializing sliced Utf8 vectors

Fixes https://issues.apache.org/jira/browse/ARROW-3336. Realized too late I branched from the #2616, so that's why there's some extra commits at the front.
Check out the relevant tests here: https://github.com/trxcllnt/arrow/blob/860f61046fca0081dbe0ce986a97408ed5934e22/js/test/unit/writer-tests.ts#L26. This test covers the primitive, nested, and dictionary vectors, but it'd be worth adding more (for example, Unions).