Skip to content

ARROW-3336: [JS] Fix IPC writer serializing sliced arrays#2638

Closed
trxcllnt wants to merge 6 commits intoapache:masterfrom
trxcllnt:js-fix-writing-sliced-arrays
Closed

ARROW-3336: [JS] Fix IPC writer serializing sliced arrays#2638
trxcllnt wants to merge 6 commits intoapache:masterfrom
trxcllnt:js-fix-writing-sliced-arrays

Conversation

@trxcllnt
Copy link
Contributor

Fixes https://issues.apache.org/jira/browse/ARROW-3336. Realized too late I branched from the #2616, so that's why there's some extra commits at the front.

Check out the relevant tests here: https://github.com/trxcllnt/arrow/blob/860f61046fca0081dbe0ce986a97408ed5934e22/js/test/unit/writer-tests.ts#L26. This test covers the primitive, nested, and dictionary vectors, but it'd be worth adding more (for example, Unions).

@trxcllnt
Copy link
Contributor Author

looks like JS is passing, but the R build has failed

@trxcllnt
Copy link
Contributor Author

trxcllnt commented Sep 27, 2018

@TheNeuralBit also commit 49c601c makes the Table constructor arg handling significantly less confusing

@wesm wesm force-pushed the js-fix-writing-sliced-arrays branch from e473e73 to bcd58ca Compare September 27, 2018 09:49
@wesm
Copy link
Member

wesm commented Sep 27, 2018

Just rebased this after merging #2616, #2635

this.addBuffer(this.getZeroBasedValueOffsets(0, length, valueOffsets));
}
// Then insert the List's values child
return this.visit((vector as any as ListVector<T>).getChildAt(0)!);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What went wrong here, exactly? Is the offset being handled elsewhere?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this offset was intended to handle the buffers for vectors that had been sliced, but that function is already slicing all of the vector's child buffers here, so this is redundant. Note that subarray is zero-copy, it's effectively managing the offset for us.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wesm Brian is right, most of the internal buffers are zero-copy sliced up front in the ArrayData internals. This is mostly a performance optimization/complexity reduction, since it allows us to index into those TypedArrays starting from zero in most cases. Notable exceptions are when the indexing is dependent on other buffers (e.g. we do slice the valueOffsets buffer of a Utf8Vector, but not the data buffer), or when multiple values are packed into < 1 byte like the validity bitmap or BoolVector's data buffer. So we still need to track the logical offset internally, but only use it in those special cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wesm if you want to see a simplified version of the internal buffer management/slicing that's going on here, I have one here in my unfinished js-data-refactor branch that I'm reeeally looking forward to getting time to finish soon.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, looking forward to getting that in :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, especially since it lets us take full advantage of TypeScript 3's more advanced type inference:

arrow-inference

Copy link
Member

@TheNeuralBit TheNeuralBit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of minor comments, but this looks good. Thanks for adding tests to cover the original problem :)

(expect([v1, `batch`, name]) as any).toEqualVector([v0, `source`]);
(expect([v2, `table`, name]) as any).toEqualVector([v0, `source`]);
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about pulling some of this logic out into a toEqualTable function? I think that would be generally useful, especially if we want to start doing some round-trip tests now that we have a writer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah totally, we should add more helpers to the jest-extensions file. We can also clean up the tests to take advantage of TS's new type X = import('some-module').X feature, to import just type definitions and not the full implementation. I think that would simplify the dynamic import logic in test/Arrow.ts and hopefully make it impossible to accidentally do this.

// If we have valueOffsets (ListVector), push that buffer first
if (valueOffsets) {
this.addBuffer(this.getZeroBasedValueOffsets(offset, length, valueOffsets));
this.addBuffer(this.getZeroBasedValueOffsets(0, length, valueOffsets));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like getZeroBasedValueOffsets is always called with a hard-coded 0 now, which makes it a no-op. Should we just get rid of it?

Copy link
Contributor Author

@trxcllnt trxcllnt Sep 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TheNeuralBit you're right that we subarray the valueOffsets buffer when we slice, but we still need to call getZeroBasedValueOffsets() in all these places because it also does work in the case that valueOffsets[0] != 0. That will be true if we performed any meaningful valueOffsets.subarray() slice in the data's sliceInternal() method, in which case we need to adjust all the values in the valueOffsets Array on the way out.

this.addBuffer(this.getZeroBasedValueOffsets(0, length, valueOffsets));
}
// Then insert the List's values child
return this.visit((vector as any as ListVector<T>).getChildAt(0)!);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this offset was intended to handle the buffers for vectors that had been sliced, but that function is already slicing all of the vector's child buffers here, so this is redundant. Note that subarray is zero-copy, it's effectively managing the offset for us.

@wesm
Copy link
Member

wesm commented Sep 27, 2018

No more comments from me, let me know if you want me to merge this or anything else

@trxcllnt
Copy link
Contributor Author

@wesm I'm happy to do the merge also, if for no other reason than to get my PR merge stats up :-)

@wesm
Copy link
Member

wesm commented Sep 27, 2018

OK, I'll leave it to you both

@trxcllnt trxcllnt closed this in b796b57 Sep 27, 2018
wesm pushed a commit that referenced this pull request Oct 11, 2018
I just noticed this commit wasn't included in #2638 :-[. My bad, here's the rest of the fix for ARROW-3336

Author: ptaylor <paul.e.taylor@me.com>

Closes #2742 from trxcllnt/js-fix-writing-sliced-arrays and squashes the following commits:

3ab9da0 <ptaylor> fix serializing sliced Utf8 vectors
kou pushed a commit to apache/arrow-js that referenced this pull request May 14, 2025
I just noticed this commit wasn't included in apache/arrow#2638 :-[. My bad, here's the rest of the fix for ARROW-3336

Author: ptaylor <paul.e.taylor@me.com>

Closes #2742 from trxcllnt/js-fix-writing-sliced-arrays and squashes the following commits:

3ab9da080 <ptaylor> fix serializing sliced Utf8 vectors
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants