-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-12861: [C++][Compute] Add sign function kernels #10395
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-12861: [C++][Compute] Add sign function kernels #10395
Conversation
f9798bf to
7e3bb1b
Compare
|
The |
|
The test for Indeed this was the case. Not sure of a current solution for this. |
|
I'd say that even if this could be called an arithmetic function it's not ideal to require ArithmeticOptions to carry fields which only apply to sign. Please write SignOptions instead |
|
Can you explain in which context this function is useful? Furthermore, can you explain what the purpose is of having two different variants? |
|
While working on the rounding functions, I make use of the signedness of the input for specific rounding modes (e.g., round-half-away-from-zero). After making |
|
@nealrichardson What do you think? Is this useful? |
|
I wouldn't say it's the most useful kernel to be adding, if I were prioritizing kernels to write, but it is something that people do. In R for example there is a |
|
It's a thing in NumPy too https://numpy.org/doc/stable/reference/generated/numpy.sign.html |
|
Ok, all three versions of the |
docs/source/cpp/compute.rst
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should add a note explaining how the output is computed. You can look at other sections for examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a "Notes" column to the table, and included notes describing the output values and special case of 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests which tweak the validity bits by hand are only meaningful if the kernel may return an error, which is not the case here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation. These tests were removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove commented out code.
447c88b to
236741c
Compare
|
The |
docs/source/cpp/compute.rst
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| * \(2) Similar to ``sign`` but zero input is considered as a signed number. | |
| * \(2) Similar to ``sign`` but zero is considered signed for floating point inputs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sign_with_signed_zero variant treats 0 as a positive value for integral types, but either -0 or +0 for floating-point types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm...I agree with this change. The sign_with_signed_zero variant only applies to floating-point numbers resulting in either (-1,1). For integral values, the result will be either of (-1,0,1).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems unnecessarily general, especially given we have only "sign". Additionally, note that CheckDispatchBest does not assert the output type: it asserts the type to which arguments must be implicitly cast:
| // Functions with fixed output type | |
| struct FuncAndOutType { | |
| std::string name; | |
| std::shared_ptr<DataType> out_type; | |
| }; | |
| FuncAndOutType funcs[] = {{"sign", int8()}}; | |
| for (const auto& func : funcs) { | |
| for (const auto& ty : {int8(), int16(), int32(), int64(), uint8(), uint16(), uint32(), | |
| uint64(), float32(), float64()}) { | |
| CheckDispatchBest(func.name, {ty}, {func.out_type}); | |
| CheckDispatchBest(func.name, {dictionary(int8(), ty)}, {func.out_type}); | |
| } | |
| } | |
| // Sign always outputs to int8 | |
| for (std::string name : {"sign", "sign_with_negative_zero"}) { | |
| for (const auto& ty : {int8(), int16(), int32(), int64(), uint8(), uint16(), uint32(), | |
| uint64(), float32(), float64()}) { | |
| CheckDispatchBest(name, {ty}, {ty}); | |
| CheckDispatchBest(name, {dictionary(int8(), ty)}, {ty}); | |
| } | |
| } |
In the example above, CheckDispatchBest(name, {ty}, {int8()}); would imply that we must first cast ty -> int8() inputs before invoking the kernel, which is incorrect. Compare this with CheckDispatchBest(name, {dictionary(int8(), ty)}, {ty}); which asserts that we can't directly extract the sign bit from dictionary encoded data; encoded arrays must first be decoded for the kernel to operate on them.
dcfcf61 to
14f3eee
Compare
39b1b0e to
fa4334f
Compare
cpp/src/arrow/compute/api_scalar.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not sure why we need both variants. Is this a user request? @ianmcook ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that we could expose a copysign instead of sign_with_signed_zero, which would be more generic.
(sign_with_signed_zero(x) would simply be copysign(x, 1), AFAICT)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pitrou The sign kernel is to be used by the rounding function. I agree with the fact that sign_with_signed_zero is equivalent to copysign. Actually, copysign is the variant needed for rounding function.
Nevertheless, sign is a common function exposed in frontend tools, so I consider it worthwhile to provide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I consider that copysign should not be exposed as a compute function but rather used internally, and instead we should expose sign and sign_with_signed_zero (implementations may use copysign(x, 1). The rationale for this is that copysign(x, 1) is not common in any SQL, dataframes library but sign is.
A use case of sign is to build a custom ranking based on the signedness of values.
For example, consider ranking values relative to a constant value (e.g., 3):
SELECT CASE
WHEN @a < 3 THEN 0
WHEN @a = 3 THEN 1
WHEN @a > 3 THEN 2
END
is not supported because expressions are not allowed in CASE-WHEN constructs, so a workaround is
SELECT CASE SIGN(@a - 3)
WHEN -1 THEN 0
WHEN 0 THEN 1
WHEN 1 THEN 2
END
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, can we answer the original question? Why are two variants needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After careful thought, there is not a general need for having two variants. The only case where I see that sign_with_signed_zero can be relevant is if data stored is used for calculus-related operations (e.g., integral limits).
I will remove the sign_with_signed_zero variant.
8a1acf7 to
1e29103
Compare
2d8bab5 to
ea19dc6
Compare
bkietz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
This PR adds the sign function to the compute layer as a unary scalar function.