Skip to content

Conversation

@edponce
Copy link
Contributor

@edponce edponce commented May 25, 2021

This PR adds the sign function to the compute layer as a unary scalar function.

  • Numeric inputs result in any of (-1,0,1)
  • +/-0 input returns 0
  • Infinity is treated as a signed number
  • NaN input returns NaN

@github-actions
Copy link

@edponce edponce force-pushed the ARROW-12861-Compute-Add-sign-function-kernels branch from f9798bf to 7e3bb1b Compare May 27, 2021 09:09
@edponce edponce marked this pull request as ready for review May 27, 2021 09:11
@edponce
Copy link
Contributor Author

edponce commented May 27, 2021

The sign function is an arithmetic function, therefore, I added its option allow_signed_zero to the ArithmeticOptions class.
Is this valid? given that sign does not overflows so it will not make use of check_overflow and currently, other compute functions do not use the allow_signed_zero option.

@edponce
Copy link
Contributor Author

edponce commented May 27, 2021

The test for double type failed in AMD64-MinGW32 with a RapidJSON error message "Number too big to be stored in double". I am not the first to encounter this error from RapidJSON. I disabled the tests that use std::numeric_limits<double>::max() to check if this is the triggering case.

Indeed this was the case. Not sure of a current solution for this.

@bkietz
Copy link
Member

bkietz commented May 27, 2021

I'd say that even if this could be called an arithmetic function it's not ideal to require ArithmeticOptions to carry fields which only apply to sign. Please write SignOptions instead

@pitrou
Copy link
Member

pitrou commented May 27, 2021

Can you explain in which context this function is useful? Furthermore, can you explain what the purpose is of having two different variants?

@edponce
Copy link
Contributor Author

edponce commented May 27, 2021

While working on the rounding functions, I make use of the signedness of the input for specific rounding modes (e.g., round-half-away-from-zero). After making sign utility functions, I found that many database systems provide such a function (although I do not have a particular use case). The option for treating zero as a signed value was suggested in issue discussion.

@pitrou
Copy link
Member

pitrou commented May 27, 2021

@nealrichardson What do you think? Is this useful?

@edponce edponce changed the title ARROW-12861: [C++][Compute] Add sign function and tests ARROW-12861: [C++][Compute] Add sign function kernels May 27, 2021
@nealrichardson
Copy link
Member

I wouldn't say it's the most useful kernel to be adding, if I were prioritizing kernels to write, but it is something that people do. In R for example there is a sign() function: https://stat.ethz.ch/R-manual/R-devel/library/base/html/sign.html, and it's a thing in SQL too.

@ianmcook
Copy link
Member

ianmcook commented Jun 1, 2021

@pitrou
Copy link
Member

pitrou commented Jun 7, 2021

Ok, all three versions of the sign function (Python, R, SQL) return (0,-1,+1), so this function should probably do the same.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should add a note explaining how the output is computed. You can look at other sections for examples.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a "Notes" column to the table, and included notes describing the output values and special case of 0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests which tweak the validity bits by hand are only meaningful if the kernel may return an error, which is not the case here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. These tests were removed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove commented out code.

@edponce edponce force-pushed the ARROW-12861-Compute-Add-sign-function-kernels branch from 447c88b to 236741c Compare June 11, 2021 08:17
@edponce
Copy link
Contributor Author

edponce commented Jun 12, 2021

The Sign kernels were created with a fixed output type (Int8) using a new kernel generator dispatcher but the tests for DispatchBest fail. The output value from DispatchBest is the same as its input and the input/output values for DispatchExact are always int8. Could someone help with this issue?

@pitrou
Copy link
Member

pitrou commented Jun 15, 2021

@bkietz may have the answer to your question, @edponce .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* \(2) Similar to ``sign`` but zero input is considered as a signed number.
* \(2) Similar to ``sign`` but zero is considered signed for floating point inputs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sign_with_signed_zero variant treats 0 as a positive value for integral types, but either -0 or +0 for floating-point types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm...I agree with this change. The sign_with_signed_zero variant only applies to floating-point numbers resulting in either (-1,1). For integral values, the result will be either of (-1,0,1).

Comment on lines 985 to 1067
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems unnecessarily general, especially given we have only "sign". Additionally, note that CheckDispatchBest does not assert the output type: it asserts the type to which arguments must be implicitly cast:

Suggested change
// Functions with fixed output type
struct FuncAndOutType {
std::string name;
std::shared_ptr<DataType> out_type;
};
FuncAndOutType funcs[] = {{"sign", int8()}};
for (const auto& func : funcs) {
for (const auto& ty : {int8(), int16(), int32(), int64(), uint8(), uint16(), uint32(),
uint64(), float32(), float64()}) {
CheckDispatchBest(func.name, {ty}, {func.out_type});
CheckDispatchBest(func.name, {dictionary(int8(), ty)}, {func.out_type});
}
}
// Sign always outputs to int8
for (std::string name : {"sign", "sign_with_negative_zero"}) {
for (const auto& ty : {int8(), int16(), int32(), int64(), uint8(), uint16(), uint32(),
uint64(), float32(), float64()}) {
CheckDispatchBest(name, {ty}, {ty});
CheckDispatchBest(name, {dictionary(int8(), ty)}, {ty});
}
}

In the example above, CheckDispatchBest(name, {ty}, {int8()}); would imply that we must first cast ty -> int8() inputs before invoking the kernel, which is incorrect. Compare this with CheckDispatchBest(name, {dictionary(int8(), ty)}, {ty}); which asserts that we can't directly extract the sign bit from dictionary encoded data; encoded arrays must first be decoded for the kernel to operate on them.

@edponce edponce force-pushed the ARROW-12861-Compute-Add-sign-function-kernels branch from dcfcf61 to 14f3eee Compare June 16, 2021 06:14
@edponce edponce requested review from bkietz and pitrou June 16, 2021 18:12
@edponce edponce force-pushed the ARROW-12861-Compute-Add-sign-function-kernels branch 2 times, most recently from 39b1b0e to fa4334f Compare June 21, 2021 20:36
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not sure why we need both variants. Is this a user request? @ianmcook ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that we could expose a copysign instead of sign_with_signed_zero, which would be more generic.
(sign_with_signed_zero(x) would simply be copysign(x, 1), AFAICT)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pitrou The sign kernel is to be used by the rounding function. I agree with the fact that sign_with_signed_zero is equivalent to copysign. Actually, copysign is the variant needed for rounding function.

Nevertheless, sign is a common function exposed in frontend tools, so I consider it worthwhile to provide.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I consider that copysign should not be exposed as a compute function but rather used internally, and instead we should expose sign and sign_with_signed_zero (implementations may use copysign(x, 1). The rationale for this is that copysign(x, 1) is not common in any SQL, dataframes library but sign is.

A use case of sign is to build a custom ranking based on the signedness of values.
For example, consider ranking values relative to a constant value (e.g., 3):

SELECT CASE 
    WHEN @a < 3 THEN 0
    WHEN @a = 3 THEN 1
    WHEN @a > 3 THEN 2
END

is not supported because expressions are not allowed in CASE-WHEN constructs, so a workaround is

SELECT CASE SIGN(@a - 3)
    WHEN -1 THEN 0
    WHEN 0 THEN 1
    WHEN 1 THEN 2
END

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, can we answer the original question? Why are two variants needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After careful thought, there is not a general need for having two variants. The only case where I see that sign_with_signed_zero can be relevant is if data stored is used for calculus-related operations (e.g., integral limits).
I will remove the sign_with_signed_zero variant.

@edponce edponce force-pushed the ARROW-12861-Compute-Add-sign-function-kernels branch from 8a1acf7 to 1e29103 Compare July 14, 2021 14:35
@edponce edponce requested review from bkietz and pitrou July 15, 2021 15:59
@edponce edponce force-pushed the ARROW-12861-Compute-Add-sign-function-kernels branch from 2d8bab5 to ea19dc6 Compare July 15, 2021 16:28
Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants