ARROW-1567: [C++] Implement "fill_null" function that replaces null values with a scalar value #7635

c-jamie · 2020-07-04T19:05:30Z

This PR implements fill null for most primitive types.

Please note this is my first time contribiting to such a project. I've attempted to follow all available guidelines.

There also may be obvious implementation details missing, as I am relatively new to CPP and it's intricacies.

github-actions · 2020-07-04T19:16:44Z

https://issues.apache.org/jira/browse/ARROW-1587

wesm

Hello and welcome to the project! Thanks for starting to work on this. I have some high level and low level comments and can help guide you through this work since this part of the project is pretty new

wesm · 2020-07-04T20:10:28Z

cpp/src/arrow/compute/api_scalar.cc

None of these input validation checks should be here. Instead, the kernel should be implemented as an Arity::Binary() kernel with input validation handled by the kernel dispatch / executor layer. It's fine if the initial version has the type signature Array/Scalar instead of Any/Any

This has been reworked

wesm · 2020-07-04T20:11:05Z

cpp/src/arrow/compute/api_scalar.h

See comments above. I think this should be implemented as a binary function since the "fill values" could be provided by an array. This also will allow the execution layer to (in the near future) insert implicit casts where needed

wesm · 2020-07-04T20:11:21Z

cpp/src/arrow/compute/kernels/scalar_fill_null.cc

Remove this line

wesm · 2020-07-04T20:14:12Z

cpp/src/arrow/compute/kernels/scalar_fill_null.cc

Per above, I think this kernel would be better implemented without a KernelInit function

wesm · 2020-07-04T20:17:03Z

cpp/src/arrow/compute/kernels/scalar_fill_null.cc

The output is never null (unless the fill value is null), right? So you don't need to touch the validity bitmap. In fact, the default mode of the kernel should not allocate one (also for reasons of zero copy, I will comment below)

I have reworked this.

wesm · 2020-07-04T20:22:14Z

cpp/src/arrow/compute/kernels/scalar_fill_null.cc

I disagree:

If the fill value is not null, then the output is non-null, so there is no need to allocate a validity bitmap in most cases. So the default behavior should be to use NullHandling::COMPUTED_NO_PREALLOCATE for the nulls

Since the kernel can do zero-copy when the input has no nulls, this needs to use MemAllocation::NO_PREALLOCATE and instead leave memory allocation to the kernel

Thanks for taking time to review - please bear with me while I work through all these.

First point makes sense. For the second, do you mind expanding on that for my understanding? Zero copy in relation to what?

You see the places where you're doing

*out_arr = data;

I presume your intent is to pass along the memory from the input argument (when the input has no nulls)? That's what's meant by zero-copy, no need to do any processing

Reworked, following your guidance.

wesm · 2020-07-04T20:23:50Z

cpp/src/arrow/compute/kernels/scalar_fill_null.cc

We should rework this to use a combination of BitBlockCounter and probably GenerateBitsUnrolled

The has been reworked.

wesm · 2020-07-04T20:25:17Z

cpp/src/arrow/compute/kernels/scalar_fill_null_test.cc

These are declared elsewhere I think?

I see them declared a couple of times in the code base as PrimitiveDictionaries eg in scalar_set_lookup_test.cc - should I use that instead?

wesm · 2020-07-04T20:26:09Z

cpp/src/arrow/compute/kernels/scalar_fill_null_test.cc

Can we use the ArrayFromJSON functions instead for specifying the test cases?

wesm · 2020-07-04T20:26:40Z

cpp/src/arrow/compute/kernels/scalar_fill_null_test.cc

The behavior of the kernel when passing a scalar with is_valid=false is not validated (and probably yield incorrect results)

I have included a test case

github-actions · 2020-07-10T10:31:52Z

https://issues.apache.org/jira/browse/ARROW-1567

wesm · 2020-07-11T20:46:29Z

This is close to what I'm looking for. I'm going to push some changes to this branch in a little while and then will merge this

… width, other fixes

wesm · 2020-07-11T21:28:33Z

I changed the implementations to do a single memory allocation and avoid the builder classes, which will be faster, and fixed some other stuff. Additionally, instantiating fewer templates since we can use e.g. a UInt64 template to process all 64-bit types including Double, etc.

wesm · 2020-07-11T21:29:34Z

cpp/src/arrow/compute/kernels/scalar_fill_null.cc

-    }
-    if (!fill_value.is_scalar()) {
-      ctx->SetStatus(Status::Invalid("fill value must be a scalar"));
-    }


Note: these type checks are not necessary. You can safely assume that once Exec is called that the types have already been checked

wesm · 2020-07-11T21:30:54Z

+1, will merge once the build passes

wesm · 2020-07-11T22:04:09Z

@c-jamie thanks for the patch, could you let me know your ASF JIRA username (or create one if you don't have one) so I can assign the issue to you?

c-jamie · 2020-07-12T09:51:51Z

I've commented on https://issues.apache.org/jira/browse/ARROW-1567

Thanks for all the help!

wesm reviewed Jul 4, 2020

View reviewed changes

c-jamie force-pushed the ARROW-1587-implement-fill-null branch from 8a0ab4a to 168a5b9 Compare July 7, 2020 16:21

c-jamie changed the title ~~ARROW-1587: [C++] implement fill null~~ ARROW-1567: [C++] implement fill null Jul 10, 2020

c-jamie added 2 commits July 11, 2020 15:45

ARROW-1587: [C++] implement fill null

0079ecc

address review comments

def327d

Do not use builders in implementation, only generate 1 kernel per bit…

bc45320

… width, other fixes

wesm force-pushed the ARROW-1587-implement-fill-null branch from 168a5b9 to bc45320 Compare July 11, 2020 21:26

wesm reviewed Jul 11, 2020

View reviewed changes

wesm changed the title ~~ARROW-1567: [C++] implement fill null~~ ARROW-1567: [C++] Implement "fill_null" function that replaces null values with a scalar value Jul 11, 2020

wesm closed this in 16290e7 Jul 11, 2020

c-jamie deleted the ARROW-1587-implement-fill-null branch September 27, 2020 17:40

asfimport mentioned this pull request Jul 12, 2020

[C++] Implement "fill null" kernels that replace null values with some scalar replacement value #17581

Closed

ARROW-1567: [C++] Implement "fill_null" function that replaces null values with a scalar value #7635

ARROW-1567: [C++] Implement "fill_null" function that replaces null values with a scalar value #7635

Uh oh!

Conversation

c-jamie commented Jul 4, 2020

Uh oh!

github-actions bot commented Jul 4, 2020

Uh oh!

wesm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wesm Jul 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jul 10, 2020

Uh oh!

wesm commented Jul 11, 2020

Uh oh!

wesm commented Jul 11, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wesm commented Jul 11, 2020

Uh oh!

wesm commented Jul 11, 2020

Uh oh!

c-jamie commented Jul 12, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wesm Jul 4, 2020 •

edited

Loading