Skip to content

Conversation

@WillAyd
Copy link
Contributor

@WillAyd WillAyd commented Jan 25, 2022

Implements casts from one struct type to another, given the same field names and number of fields

@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

Comment on lines 161 to 164
if (in_field_count != out_field_count) {
ARROW_RETURN_NOT_OK(
Status(StatusCode::TypeError, "struct field sizes do not match"));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could imagine the cast dropping columns or adding columns of nulls in this case, too (ARROW-7051 would make that more efficient, and I believe this is needed to fully complete ARROW-14658)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, you can just return Status::TypeError(...) - no need to use the macro or construct the status manually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So do you think 7051 is a pre-cursor to this change?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can leave it for afterwards/we can file another JIRA for this case.

@WillAyd
Copy link
Contributor Author

WillAyd commented Jan 26, 2022

Thanks for the awesome feedback - I'll look to tackle this over the next week or so

@WillAyd
Copy link
Contributor Author

WillAyd commented Feb 9, 2022

Alright passing now - a couple failures but I think unrelated

Comment on lines 2315 to 2322
std::vector<std::shared_ptr<Field>> fields4 = {
std::make_shared<Field>("a", int8(), false),
std::make_shared<Field>("b", int8(), false)};
std::shared_ptr<Array> a4, b4;
a4 = ArrayFromJSON(int8(), "[1, 2]");
b4 = ArrayFromJSON(int8(), "[3, 4]");
ASSERT_OK_AND_ASSIGN(auto dest2, StructArray::Make({a4, b4}, fields4));
auto options = CastOptions::Safe(dest2->type());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, instead of constructing dest2, why not just use arrow::struct_() to directly construct the type?


const ArrayData& in_array = *batch[0].array();
ArrayData* out_array = out->mutable_array();
out_array->buffers = in_array.buffers;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I look at this, I think there's still a missing case here. We should copy the offset over as well (and then we should not slice the child arrays below), or if there's a top-level validity buffer, we should slice the buffer before copying. I don't think the tests notice this right now because there's no top-level validity buffer in any of the test cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StructArray::Make does take a null bitmap so we should be able to test this.

Copy link
Contributor Author

@WillAyd WillAyd Feb 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First part of this is done. For the sake of comprehensiveness I tried introducing null values into the main test, but didn't make a difference at all. Is that different from explicitly providing a bitmap as you've described?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's different, yes, since StructArray is nested: for a given row, either the subarrays can be null, or the top-level StructArray itself may be null (and then the child array's nullity are ignored).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great explanation - think I've added this to the test now but didn't change the pass/fail status. Still missing something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good now, since we're not slicing the child and copying over the out offset, so it should pass.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though - sorry - but it's probably better to slice the children, not copy the offset, and slice the bitmap in in_array.buffers when copying? The reason being, if we Cast a large array that's been sliced, this right now will cast the entire child array, even though we only technically care about the sliced part. Does that make sense?

The null bitmap can be copied with CopyBitmap:

ARROW_EXPORT
Result<std::shared_ptr<Buffer>> CopyBitmap(MemoryPool* pool, const uint8_t* bitmap,
int64_t offset, int64_t length);

It would be something like

if (in_array.buffers[0]) {
  ARROW_ASSIGN_OR_RAISE(out_array->buffers[0], CopyBitmap(ctx->memory_pool(), in_array.buffers[0]->data(), in_array.offset, in_array.length));
}

Though, frankly, it should work below to set kernel.null_handling to INTERSECTION and then the compute infra will compute the null bitmap for you. (And you shouldn't have to mess with out_array->buffers or out_array->offset.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, there's lots of knobs to turn for this, so there's not one obvious way to implement things. I would see if kernel.null_handling + slicing children before casting works (it should, as far as I can tell without building and trying it myself.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries - this has been a great learning experience for me. Your code suggestion looks pretty similar to what's already being done with the List casts - I just didn't understand enough before why to include but makes sense here. Let me see how this all looks

static Status Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
const CastOptions& options = CastState::Get(ctx);
const auto in_field_count =
checked_cast<const StructType&>(*batch[0].type()).num_fields();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, but we could factor out the checked_cast of the types here and below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be my lack of C++ knowledge but when I tried that I was running into an issue where StructType was being immediately deleted given its lack of move or copy operators. Can post the trace back later if helpful

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was tried? Something like this should work, off the top of my head:

const StructType& in_type = checked_cast<const StructType&>(*batch[0].type());

Copy link
Contributor Author

@WillAyd WillAyd Feb 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea pretty sure that was what I tried, but let me do that again tonight and post the traceback

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool this is working. My mistake was trying to do const auto in_type = checked_cast<const StructType&>(*batch[0].type()); which yielded

/arrow/cpp/src/arrow/type.h:972:20: error: use of deleted function ‘std::unique_ptr<_Tp, _Dp>::unique_ptr(const std::unique_ptr<_Tp, _Dp>&) [with _Tp = arrow::StructType::Impl; _Dp = std::default_delete<arrow::StructType::Impl>]’

Not using auto here works well

Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for working through this!

@lidavidm lidavidm closed this in 3b9462a Feb 11, 2022
@ursabot
Copy link

ursabot commented Feb 11, 2022

Benchmark runs are scheduled for baseline = 90a6f11 and contender = 3b9462a. 3b9462a is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.3% ⬆️0.0%] test-mac-arm
[Failed] ursa-i9-9960x
[Finished ⬇️0.17% ⬆️0.09%] ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@jorisvandenbossche
Copy link
Member

Just adding it here as a reference: there is https://issues.apache.org/jira/browse/ARROW-15643 which is talking about some potential ways to relax this struct->struct cast (eg allow a subset of the fields)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants