-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-1888: [C++] Implement Struct Casts #12248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
| if (in_field_count != out_field_count) { | ||
| ARROW_RETURN_NOT_OK( | ||
| Status(StatusCode::TypeError, "struct field sizes do not match")); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could imagine the cast dropping columns or adding columns of nulls in this case, too (ARROW-7051 would make that more efficient, and I believe this is needed to fully complete ARROW-14658)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, you can just return Status::TypeError(...) - no need to use the macro or construct the status manually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So do you think 7051 is a pre-cursor to this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can leave it for afterwards/we can file another JIRA for this case.
|
Thanks for the awesome feedback - I'll look to tackle this over the next week or so |
|
Alright passing now - a couple failures but I think unrelated |
| std::vector<std::shared_ptr<Field>> fields4 = { | ||
| std::make_shared<Field>("a", int8(), false), | ||
| std::make_shared<Field>("b", int8(), false)}; | ||
| std::shared_ptr<Array> a4, b4; | ||
| a4 = ArrayFromJSON(int8(), "[1, 2]"); | ||
| b4 = ArrayFromJSON(int8(), "[3, 4]"); | ||
| ASSERT_OK_AND_ASSIGN(auto dest2, StructArray::Make({a4, b4}, fields4)); | ||
| auto options = CastOptions::Safe(dest2->type()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, instead of constructing dest2, why not just use arrow::struct_() to directly construct the type?
|
|
||
| const ArrayData& in_array = *batch[0].array(); | ||
| ArrayData* out_array = out->mutable_array(); | ||
| out_array->buffers = in_array.buffers; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I look at this, I think there's still a missing case here. We should copy the offset over as well (and then we should not slice the child arrays below), or if there's a top-level validity buffer, we should slice the buffer before copying. I don't think the tests notice this right now because there's no top-level validity buffer in any of the test cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StructArray::Make does take a null bitmap so we should be able to test this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First part of this is done. For the sake of comprehensiveness I tried introducing null values into the main test, but didn't make a difference at all. Is that different from explicitly providing a bitmap as you've described?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's different, yes, since StructArray is nested: for a given row, either the subarrays can be null, or the top-level StructArray itself may be null (and then the child array's nullity are ignored).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great explanation - think I've added this to the test now but didn't change the pass/fail status. Still missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good now, since we're not slicing the child and copying over the out offset, so it should pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though - sorry - but it's probably better to slice the children, not copy the offset, and slice the bitmap in in_array.buffers when copying? The reason being, if we Cast a large array that's been sliced, this right now will cast the entire child array, even though we only technically care about the sliced part. Does that make sense?
The null bitmap can be copied with CopyBitmap:
arrow/cpp/src/arrow/util/bitmap_ops.h
Lines 44 to 46 in ec38aeb
| ARROW_EXPORT | |
| Result<std::shared_ptr<Buffer>> CopyBitmap(MemoryPool* pool, const uint8_t* bitmap, | |
| int64_t offset, int64_t length); |
It would be something like
if (in_array.buffers[0]) {
ARROW_ASSIGN_OR_RAISE(out_array->buffers[0], CopyBitmap(ctx->memory_pool(), in_array.buffers[0]->data(), in_array.offset, in_array.length));
}
Though, frankly, it should work below to set kernel.null_handling to INTERSECTION and then the compute infra will compute the null bitmap for you. (And you shouldn't have to mess with out_array->buffers or out_array->offset.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, there's lots of knobs to turn for this, so there's not one obvious way to implement things. I would see if kernel.null_handling + slicing children before casting works (it should, as far as I can tell without building and trying it myself.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries - this has been a great learning experience for me. Your code suggestion looks pretty similar to what's already being done with the List casts - I just didn't understand enough before why to include but makes sense here. Let me see how this all looks
| static Status Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) { | ||
| const CastOptions& options = CastState::Get(ctx); | ||
| const auto in_field_count = | ||
| checked_cast<const StructType&>(*batch[0].type()).num_fields(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, but we could factor out the checked_cast of the types here and below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be my lack of C++ knowledge but when I tried that I was running into an issue where StructType was being immediately deleted given its lack of move or copy operators. Can post the trace back later if helpful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What was tried? Something like this should work, off the top of my head:
const StructType& in_type = checked_cast<const StructType&>(*batch[0].type());
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea pretty sure that was what I tried, but let me do that again tonight and post the traceback
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool this is working. My mistake was trying to do const auto in_type = checked_cast<const StructType&>(*batch[0].type()); which yielded
/arrow/cpp/src/arrow/type.h:972:20: error: use of deleted function ‘std::unique_ptr<_Tp, _Dp>::unique_ptr(const std::unique_ptr<_Tp, _Dp>&) [with _Tp = arrow::StructType::Impl; _Dp = std::default_delete<arrow::StructType::Impl>]’
Not using auto here works well
lidavidm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for working through this!
|
Benchmark runs are scheduled for baseline = 90a6f11 and contender = 3b9462a. 3b9462a is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
|
Just adding it here as a reference: there is https://issues.apache.org/jira/browse/ARROW-15643 which is talking about some potential ways to relax this struct->struct cast (eg allow a subset of the fields) |
Implements casts from one struct type to another, given the same field names and number of fields