Skip to content

Conversation

@c-jamie
Copy link

@c-jamie c-jamie commented Jul 4, 2020

This PR implements fill null for most primitive types.

Please note this is my first time contribiting to such a project. I've attempted to follow all available guidelines.

There also may be obvious implementation details missing, as I am relatively new to CPP and it's intricacies.

@github-actions
Copy link

github-actions bot commented Jul 4, 2020

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello and welcome to the project! Thanks for starting to work on this. I have some high level and low level comments and can help guide you through this work since this part of the project is pretty new

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of these input validation checks should be here. Instead, the kernel should be implemented as an Arity::Binary() kernel with input validation handled by the kernel dispatch / executor layer. It's fine if the initial version has the type signature Array/Scalar instead of Any/Any

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been reworked

Copy link
Member

@wesm wesm Jul 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments above. I think this should be implemented as a binary function since the "fill values" could be provided by an array. This also will allow the execution layer to (in the near future) insert implicit casts where needed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this line

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per above, I think this kernel would be better implemented without a KernelInit function

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output is never null (unless the fill value is null), right? So you don't need to touch the validity bitmap. In fact, the default mode of the kernel should not allocate one (also for reasons of zero copy, I will comment below)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reworked this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree:

  • If the fill value is not null, then the output is non-null, so there is no need to allocate a validity bitmap in most cases. So the default behavior should be to use NullHandling::COMPUTED_NO_PREALLOCATE for the nulls
  • Since the kernel can do zero-copy when the input has no nulls, this needs to use MemAllocation::NO_PREALLOCATE and instead leave memory allocation to the kernel

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking time to review - please bear with me while I work through all these.

First point makes sense. For the second, do you mind expanding on that for my understanding? Zero copy in relation to what?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You see the places where you're doing

    *out_arr = data;

I presume your intent is to pass along the memory from the input argument (when the input has no nulls)? That's what's meant by zero-copy, no need to do any processing

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworked, following your guidance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should rework this to use a combination of BitBlockCounter and probably GenerateBitsUnrolled

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The has been reworked.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are declared elsewhere I think?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see them declared a couple of times in the code base as PrimitiveDictionaries eg in scalar_set_lookup_test.cc - should I use that instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use the ArrayFromJSON functions instead for specifying the test cases?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior of the kernel when passing a scalar with is_valid=false is not validated (and probably yield incorrect results)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have included a test case

@c-jamie c-jamie force-pushed the ARROW-1587-implement-fill-null branch from 8a0ab4a to 168a5b9 Compare July 7, 2020 16:21
@c-jamie c-jamie changed the title ARROW-1587: [C++] implement fill null ARROW-1567: [C++] implement fill null Jul 10, 2020
@github-actions
Copy link

@wesm
Copy link
Member

wesm commented Jul 11, 2020

This is close to what I'm looking for. I'm going to push some changes to this branch in a little while and then will merge this

@wesm wesm force-pushed the ARROW-1587-implement-fill-null branch from 168a5b9 to bc45320 Compare July 11, 2020 21:26
@wesm
Copy link
Member

wesm commented Jul 11, 2020

I changed the implementations to do a single memory allocation and avoid the builder classes, which will be faster, and fixed some other stuff. Additionally, instantiating fewer templates since we can use e.g. a UInt64 template to process all 64-bit types including Double, etc.

}
if (!fill_value.is_scalar()) {
ctx->SetStatus(Status::Invalid("fill value must be a scalar"));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: these type checks are not necessary. You can safely assume that once Exec is called that the types have already been checked

@wesm wesm changed the title ARROW-1567: [C++] implement fill null ARROW-1567: [C++] Implement "fill_null" function that replaces null values with a scalar value Jul 11, 2020
@wesm
Copy link
Member

wesm commented Jul 11, 2020

+1, will merge once the build passes

@wesm wesm closed this in 16290e7 Jul 11, 2020
@wesm
Copy link
Member

wesm commented Jul 11, 2020

@c-jamie thanks for the patch, could you let me know your ASF JIRA username (or create one if you don't have one) so I can assign the issue to you?

@c-jamie
Copy link
Author

c-jamie commented Jul 12, 2020

I've commented on https://issues.apache.org/jira/browse/ARROW-1567

Thanks for all the help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants