ARROW-12751: [C++] Implement minimum/maximum kernels #10390

lidavidm · 2021-05-24T22:00:23Z

This is a bit messy, but implements a variadic scalar maximum/minimum kernel.

github-actions · 2021-05-24T22:00:45Z

https://issues.apache.org/jira/browse/ARROW-12751

lidavidm · 2021-05-24T22:00:33Z

cpp/src/arrow/compute/api_scalar.h

Do we want to name this something else?

I think it is worth giving this a more general name, because there are other potential uses. Maybe ElementWiseAggregateOptions? (since this is aggregation, just across instead of down)

If using "Aggregate" is too confusing, maybe "Combine" or "Merge" would work too. e.g. ElementWiseCombineOptions

Since we have ScalarAggregateOptions I think ElementWiseAggregateOptions should be ok. I've updated the PR.

cpp/src/arrow/compute/api_scalar.h

lidavidm · 2021-05-24T22:01:21Z

cpp/src/arrow/compute/kernels/scalar_arithmetic.cc

This is rather a lot of code being generated; it might be nice if I could figure out the right template setup to consolidate the temporal types with their respective equivalent integral type.

There's GeneratePhysicalInteger, though that doesn't include floating point. Perhaps you could add GeneratePhysicalNumeric?

I added this. Note that if we want to have CommonTimestamp() I don't think this'll fly since we'd need to scale the timestamp values to the common unit before comparison.

No, CommonTimestamp indicates that an implicit cast is necessary (which in the case of timestamps includes the scaling). It's not the responsibility of the kernel to execute that implicit cast

Ah! Ok, my mistake. I'll add some test cases for that.

Binary types aren't implemented here, though (and would need some work/restructuring to handle).

cpp/src/arrow/ipc/json_simple.cc

cpp/src/arrow/testing/gtest_util.cc

cpp/src/arrow/ipc/json_simple.h

cpp/src/arrow/compute/kernel.h

cpp/src/arrow/compute/kernels/scalar_arithmetic.cc

cpp/src/arrow/compute/kernels/codegen_internal.h

cpp/src/arrow/compute/kernels/scalar_arithmetic.cc

lidavidm · 2021-05-25T21:35:39Z

(N.B. I still need to fix CommonTimestamp, got caught up in fixing the other feedback)

pitrou

Can you explain what the point is of a variadic function when we don't have e.g. variadic addition?

cpp/src/arrow/compute/kernels/scalar_arithmetic.cc

pitrou · 2021-05-26T13:06:28Z

cpp/src/arrow/compute/kernels/scalar_arithmetic.cc

Use arr->null_count() == 0?

pitrou · 2021-05-26T13:06:48Z

cpp/src/arrow/compute/kernels/scalar_arithmetic.cc

This will allocate a null bitmap even if all input arrays have 0 nulls.

pitrou · 2021-05-26T13:07:28Z

cpp/src/arrow/compute/kernels/scalar_arithmetic.cc

Instead of if (first), you could move the bitmap allocation below and use if (!output->buffers[0]).

cpp/src/arrow/compute/kernels/codegen_internal.h

pitrou · 2021-05-26T13:18:28Z

cpp/src/arrow/ipc/json_simple.cc

Shouldn't we ensure it's equal to 1?

pitrou · 2021-05-26T13:20:16Z

docs/source/cpp/compute.rst

Should add a column for the options class.

pitrou · 2021-05-26T13:33:35Z

cpp/src/arrow/compute/kernels/scalar_arithmetic_test.cc

Why isn't this done by default in the test setup?

cpp/src/arrow/compute/kernels/scalar_arithmetic_test.cc

pitrou · 2021-05-26T13:35:11Z

cpp/src/arrow/compute/kernels/scalar_arithmetic_test.cc

Uh. I don't think generating string representations for each and every test is a good idea.

ianmcook · 2021-05-26T14:36:09Z

Can you explain what the point is of a variadic function when we don't have e.g. variadic addition?

My understanding is that we are aiming for Arrow's compute API to reach parity with the collections of built-in functions available in popular SQL engines. These SQL engines include several variadic functions that combine/merge/aggregate values row-wise. Two such SQL functions are greatest() and least(). The two kernels implemented in this PR are equivalent to those two SQL functions. Another such SQL function is concat() and its variant concat_ws(), which we intend to implement in ARROW-12709.

There are two ways we could implement this functionality without implementing variadic functions:

Chain multiple calls to a binary function (as is required for addition of more than two values, as you mention)
Combine the arrays row-wise into a ListArray (ARROW-12739) then operate on the ListArray with a unary function

If one of those alternative approaches is superior, then perhaps we do not need a variadic function. But my current understanding is that both of these alternative approaches are inferior for reasons of usability and efficiency respectively.

pitrou · 2021-05-26T14:43:35Z

I see. Well, if efficiency is really important and it's common to have more than 2 inputs, then why not (though the implementation in this PR is probably far from optimal).

pitrou · 2021-05-28T13:47:07Z

docs/source/cpp/compute.rst

It just came to my mind that it's gonna be very confusing to have "minimum" and "maximum" functions in addition to "min_max". Perhaps we can be less ambiguous, e.g. "scalar_min", "scalar_max" (or "horizontal_min" or "row_min"...).

That's a good point. I'll rename the compute function. Perhaps elementwise_min in accordance with the options struct (though that is a bit wordy)?

That's wordy indeed. Or at least element_wise_min.

How about naming them least and greatest like they do in SQL?

How is that less ambiguous?

I just changed it to element_wise_min/max.

I agree that using least and greatest would not fully resolve the ambiguity (except to users familiar with SQL) but the use of these synonyms would at least signal that these are something different from min and max. I think element_wise_min and element_wise_max are fine and do a better job of disambiguating as long as we're OK with their length.

cpp/src/arrow/compute/kernels/scalar_arithmetic.cc

bkietz · 2021-06-01T14:26:50Z

cpp/src/arrow/compute/kernels/scalar_arithmetic.cc

Instead of trying to recycle buffers, we could change the kernel's flag and allow the executor to preallocate the output data buffer. Then we populate this with the anti extreme (for maximum/uint64_t we zero the buffer). Then we wouldn't need to use a bitmap to express "no values folded yet"; get_extreme({anti_extreme, v...}) == get_extreme({v...}). We'd also be able to compute the null bitmap separately in either case: !skip_nulls AND the bitmaps, skip_nulls OR the bitmaps.

In the case where scalar_count != 0, we can ExecScalar and use the result instead of an anti_extreme

This works quite well except in the case that we have a singular NaN array argument (element_wise_min([NaN])) - in which case we want to emit NaN not the antiextreme.

Ah, maybe let's just have the 'antiextreme' of a float be NaN.

The antiextremes of float are Inf, -Inf

Inf would trip up if we had something like element_wise_min([NaN, 0], [NaN, 1]) in which case we'd want to emit NaN and not Inf as the first element. Using Inf as the antiextreme means we'd emit Inf since the kernel takes any non-NaN value over any NaN.

Huh, I was expecting fmin(NaN, x) = NaN but in fact it's fmin(NaN, x) = x. Nevermind

ianmcook · 2021-06-03T19:22:35Z

@lidavidm could you briefly describe what happens here if the arrays passed to these kernels have different but comparable data types such as one array of doubles and one array of integers?

lidavidm · 2021-06-03T19:24:37Z

@lidavidm could you briefly describe what happens here if the arrays passed to these kernels have different but comparable data types such as one array of doubles and one array of integers?

For numeric/temporal types, all arrays will be cast to a common compatible type, much the same as with arithmetic kernels (e.g. for doubles + integers, all values will be cast to doubles).

bkietz

Thanks for doing this!

github-actions bot added the Component: C++ label May 24, 2021

lidavidm commented May 24, 2021

View reviewed changes

cpp/src/arrow/ipc/json_simple.h Outdated Show resolved Hide resolved

cyb70289 reviewed May 25, 2021

View reviewed changes

cpp/src/arrow/compute/kernel.h Outdated Show resolved Hide resolved

lidavidm force-pushed the arrow-12751 branch 2 times, most recently from 86b55b8 to 7bb91e0 Compare May 25, 2021 15:58

bkietz requested changes May 25, 2021

View reviewed changes

cpp/src/arrow/compute/kernels/scalar_arithmetic.cc Outdated Show resolved Hide resolved

cpp/src/arrow/compute/kernels/scalar_arithmetic.cc Outdated Show resolved Hide resolved

cpp/src/arrow/compute/kernels/scalar_arithmetic.cc Outdated Show resolved Hide resolved

bkietz reviewed May 25, 2021

View reviewed changes

cpp/src/arrow/compute/kernels/codegen_internal.h Outdated Show resolved Hide resolved

bkietz reviewed May 25, 2021

View reviewed changes

cpp/src/arrow/compute/kernels/scalar_arithmetic.cc Outdated Show resolved Hide resolved

bkietz reviewed May 25, 2021

View reviewed changes

cpp/src/arrow/compute/kernels/scalar_arithmetic.cc Outdated Show resolved Hide resolved

pitrou requested changes May 26, 2021

View reviewed changes

lidavidm force-pushed the arrow-12751 branch from a55cfd8 to a11ad0b Compare May 26, 2021 18:39

pitrou reviewed May 28, 2021

View reviewed changes

bkietz reviewed May 28, 2021

View reviewed changes

cpp/src/arrow/compute/kernels/scalar_arithmetic.cc Outdated Show resolved Hide resolved

bkietz reviewed Jun 1, 2021

View reviewed changes

lidavidm force-pushed the arrow-12751 branch 2 times, most recently from 50e86be to 33813ac Compare June 2, 2021 16:26

lidavidm added 7 commits June 3, 2021 14:28

ARROW-12751: [C++] Implement minimum/maximum kernels

ea4629c

ARROW-12751: [C++] Fix build issues

3bed635

ARROW-12751: [C++] Fix Clang warnings

b9a19f3

ARROW-12751: [C++] Fix name resolution issue

60ab9d0

ARROW-12751: [C++] Update options class name

0006207

ARROW-12751: [C++] Reorganize kernel implementation

72ce13d

ARROW-12751: [C++] Simplify kernel implementation

a0bbdf1

lidavidm added 7 commits June 3, 2021 14:28

ARROW-12751: [C++] Test casting to common timestamp type

28133cc

ARROW-12751: [C++] Fix review feedback

ebe0b1c

ARROW-12751: [C++] Rename to element_wise_min/max

9dc5b33

ARROW-12751: [C++] Preallocate the output buffer

6665d99

ARROW-12751: [C++] Fix MSVC error

619af77

ARROW-12751: [C++] Make Clang happy

85bdb33

ARROW-12751: [C++] Move element_wise_min/max to scalar_compare.cc

d37ad8c

lidavidm force-pushed the arrow-12751 branch from 33813ac to d37ad8c Compare June 3, 2021 19:45

bkietz approved these changes Jun 3, 2021

View reviewed changes

bkietz closed this in e1690d6 Jun 4, 2021

asfimport mentioned this pull request Jun 4, 2021

[C++] Add variadic row-wise min/max kernels (least/greatest) #28493

Closed

ARROW-12751: [C++] Implement minimum/maximum kernels #10390

ARROW-12751: [C++] Implement minimum/maximum kernels #10390

Uh oh!

Conversation

lidavidm commented May 24, 2021

Uh oh!

github-actions bot commented May 24, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lidavidm commented May 25, 2021

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ianmcook commented May 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pitrou commented May 26, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

ianmcook commented May 26, 2021 •

edited

Loading

ianmcook commented Jun 3, 2021 •

edited

Loading