-
Notifications
You must be signed in to change notification settings - Fork 1.1k
fix: take_bytes should not reuse indices null buffer #6616
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| } | ||
|
|
||
| let values = r.as_string::<i32>().iter().collect::<Vec<_>>(); | ||
| assert_eq!(&values, &[None, Some("foo"), Some("foo"), Some("foo")]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before this fix, the result will become None, None, None, None.
tustvold
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comet shouldn't be modifying a buffer that is not exclusively owned by it, this is UB and breaks all sorts of invariants
| let null_slice = binding.data_ptr(); | ||
| for i in 0..4 { | ||
| unsafe { | ||
| bit_util::unset_bit_raw(null_slice.as_ptr(), i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is UB, you're mutating a pointer you don't have exclusive access to
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This just simulates the case that the null buffer is modified after take is called.
Even don't consider Comet, I think for take semantics (it is not "clone"), it should not just slice input arrays (data or null).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This just simulates the case that the null buffer is modified after take is called.
Which is UB, you are mutating a buffer without exclusive access to it. The optimisation of reusing the buffer is perfectly sound because of the aliasing invariants of Rust, not only do we rely on this extensively within both arrow and DataFusion, but the compiler itself relies on it
It is not about should or shouldn't. Comet reuses underlying buffers in the native reader. This is not exclusive to Comet. |
Rust's memory model disallows aliasing buffers in this way - see here. What you describe contravenes this model and is therefore UB. If you cannot guarantee that comet's native reader won't reuse the buffer behind Rust's back, you have to copy the memory into a fresh allocation. |
|
Good. Let's not talk about Comet. Let's simply focus on the kernel behavior. I think for take semantics (it is not "clone"), it should not just slice input arrays (data or null). |
Why? It is a perfectly sound optimisation, there is no reason to perform a deep copy here? It is UB for anyone to mutate the buffer behind our back, so there is no reason to not just use it? |
|
Why? Currently you have a kernel with inconsistent behavior. Isn't it enough? It makes confused for users and users might need to look into the implementation details to know what the inconsistency comes from. For example, you tries to modify the indices array after calling I don't think this is good design for a kernel behavior. |
The contract of a kernel should be on the semantic value of the output, this leaves kernels free to implement physical layout optimisations such as this. I don't believe we ever document or articulate any contract on how various kernels should behave w.r.t their inputs, this would not only be very restrictive but extremely fragile. Take is far from the only kernel that will simply clone input buffers, especially if one considers types like DictionaryArray where the underlying dictionary may or may not be recomputed depending on heuristics, or any of the ViewArray types. From the take kernel's perspective it has no way to know that you want to reuse the null buffer that it was given, so the correct thing is for it to not create a fresh allocation unnecessarily. If code then wants to try to reuse the buffer, it can try, falling back to performing the allocation if necessary. This is safe, sound and optimal. In general the approach of arrow-rs and DataFusion is to only allocate when absolutely necessary |
|
I don't agree but I respect. |
|
Perhaps @alamb might provide a second opinion. FWIW you need to fix the aliasing bug, regardless of thoughts on what the take kernel is doing, what comet is doing is UB |
|
The consistency of kernel behavior and design is higher than such minor optimization. For now, users need to dig into the details inside the kernel when calling, if they don't have any knowledge about this crate. I don't think this is a good design from a higher level point of view. |
Perhaps you could articulate what the policy is you expect to be? Should all kernels deep clone all inputs? Would this not have severe performance ramifications?
Yes relying on behaviour beyond the documented contract of a kernel is a recipe for needing to dig into the details of said function. https://xkcd.com/1172/ comes to mind here.
We may have to agree to disagree, but let's see what Andrew has to say |
For This is the
I'm not sure for others, but for me "take" semantics is to have something separately, not share it. The current behavior is counterintuitive (at least to me). |
|
FWIW my view: I agree with @tustvold many kernels try to reuse buffers / input arrays if possible to minimize allocations / copies and I would expect arrow-rs to do so if possible (for whatever kernel) whenever dealing with immutable buffers (behind |
I don't fully understand this assertion. Is the idea that since the docs say
That implies the output array should be entirely new (aka not share anything with the source arrays)? |
|
Instead of modifying the Like could we have comet call some function like /// Ensure that all underlying buffers are purely owned (not shared)
fn ensure_owned(array: ArrayRef) -> ArrayRef {
// downcast DictionaryArray and look for slice / shared references
...
}🤔 |
I agree that having to dig into the internals of the implementation is not a good design. Maybe we should better document that the arrow-rs kernels / implementations all assume immutable buffers and have various optimizations that may reuse portions of the input |
|
My biggest concern is not really about this particular PR, but more the general question of "what can be assumed about the output of a function given it got an Do we need to document / test the various kernels and operations that guarantee no part of the input array would be reused 🤔 |
We do have the beginnings of the docs here - https://docs.rs/arrow-array/latest/arrow_array/#low-level-api But we could definitely document better that the array abstractions, and by extension the kernels that operate on them, are built around reference counted immutable buffers, and implementations should not make any assumptions about sole ownership of buffers. If codepaths require sole ownership they must use the safe APIs for doing so, such as into_mutable.
Tbh I am a little confused why one would make any assumptions here, but it certainly couldn't hurt to document that we avoid allocating whenever possible. |
|
Let me clarify it. I am not against the behavior of sharing buffers between input/output on other kernels. But for |
|
For a DictionaryArray would you expect it to create a deep clone of the dictionary values? What about a ViewArray where the whole design is to avoid copying the values buffers? I think this discussion is getting a bit philosophical and not particularly productive, we fundamentally cannot change take to always perform a deep copy of the input without significantly regressing performance. It is a fundamental design of arrow-rs, and other arrow implementations, that buffers are immutable reference counted memory regions and kernels are free to take advantage of that. Perhaps you might be willing to suggest some documentation improvements that would clarify this to avoid future confusion? |
|
I will add some documentation to I am adding workaround in Comet codebase and will also have some custom unpack dictionary and take to prevent the issue. So it won't rely on arrow-rs kernel. |
What are you trying to say? When unpacking dictionary, how do you not copy dictionary values? If you are talking about |
|
Let me close this. I will open another PR for documenting the kernel. Thanks for your review. |
Btw, note that |
Nobody is disputing this, however, there are many cases where it reuses buffers and entire arrays. Changing this would significantly regress performance for those cases. Yes copying a null buffer as in this PR isn't a terrible regression, but the broader principle you're advocating for that take should never reuse its inputs would be terrible for performance |
|
Thank you @viirya @Dandandan and @tustvold -- I hope we end up in a place that is good for Comet and good for other users of arrow as well |
I don't really find "many" cases note: I don't know how you can take indexed values from buffers while reusing the buffers at the same time, except for special ones like dictionary value (which I already mentioned it is reasonable). |
|
From a quick scan
Not sure if that counts as "many", but it is very important for those cases that we don't clone. Taking a step back though, it is kind of irrelevant whether the kernel reuses or doesn't reuse buffers, arrow is explicitly designed to encourage such reuse and we shouldn't be seeking to document or make commitments either way for any kernel. Not only does doing so limit our optimisations, it would be extremely fragile and hard to test, validate and maintain. That being said expanding the docs to further clarify this point is likely a valuable addition. Edit: as a general rule invariants should either be encoded in the type system or checked at runtime |
|
Yea, these are the only cases and I believe that are not "many". Not to mention that they are reasonable cases as these arrays are designed to be shared/reused. I already mentioned this in previous comment. That's why I don't understand that you claim that performance regression will be a problem for such a change. Except for above cases, the only cases reusing is happening on null buffer in With a negligible cost, making Anyway, as the workaround was already added to bypass the issue. And I don't think we can get a consensus. I respect your decision here. |
You are regressing performance, albeit marginally, to achieve something that isn't part of any kernel's public contract, nor that would be maintained or tested going forwards. So it is hard for me to see this as anything but a net loss, it doesn't add anything and makes performance worse. |
function name is not public contract? You design a function called "not_copy" but it copies? |
These statements are contradictory. I'm not going to engage further on this ticket |
Which issue does this PR close?
Closes #6617.
Rationale for this change
What changes are included in this PR?
Are there any user-facing changes?