feat: Implement Spark-compatible CAST from non-integral numeric types to integral types #399

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

andygrove merged 20 commits into apache:main from rohitrastogi:rohit/float-to-int

May 8, 2024

Contributor

rohitrastogi commented May 7, 2024 •

edited

Loading

Which issue does this PR close?

Closes #350

Rationale for this change

Improve compatibility with Spark

What changes are included in this PR?

Functionality to "Spark compatible cast" floats, doubles, and decimals to bytes, shorts, ints, and longs
Enhanced decimal tests to check for cast edge cases

How are these changes tested?

CometCastSuite tests pass.

Rohit Rastogi added 13 commits

May 5, 2024 17:54


          WIP - float to int, sketchy

0c96cc5


          WIP - extremely ugly but functional

12366f1


          WIP - use macro

cb2ffe4


          simply further

8fb025a


          delete dead code

6383d9d


          make format

41b7273


          progress on decimals

1b3e25e


          refactor

3c95284


          format decimal value in overflow exception

6b112c7


          wip - have to use 4 macros, need more decimal tests

927daf6


          ready for review

48efdc5


          forgot to commit whoops


          Merge remote-tracking branch 'upstream/main' into rohit/float-to-int

21c4b77

rohitrastogi marked this pull request as draft

May 7, 2024 20:44


          bad merge

9d25943

rohitrastogi marked this pull request as ready for review

May 7, 2024 21:18

Member

andygrove commented May 7, 2024

Thank you @rohitrastogi this looks really great! Could you also run mvn package -DskipTests to regenerate compatibility.md file to include the newly supported cast expressions?

andygrove reviewed

View reviewed changes

core/src/execution/datafusion/expressions/cast.rs Outdated Show resolved Hide resolved

rohitrastogi commented

View reviewed changes

core/src/execution/datafusion/expressions/cast.rs Show resolved Hide resolved

Rohit Rastogi added 2 commits

May 7, 2024 22:05


          address pr comments

fd1cf97


          commit missed compatibility

25ccd23

rohitrastogi commented

View reviewed changes

core/src/execution/datafusion/expressions/cast.rs Outdated Show resolved Hide resolved

Contributor Author

rohitrastogi commented May 8, 2024

I ran mvn package -DskipTests to regenerate compatibility.md.

rohitrastogi requested a review from andygrove

May 8, 2024 05:12

Rohit Rastogi added 2 commits

May 7, 2024 22:17


          improve error message

e4c8a47


          improve error message again

a854eb9

andygrove reviewed

View reviewed changes

core/src/execution/datafusion/expressions/cast.rs Outdated Show resolved Hide resolved


          revert perf reression in cast_int_to_int_macro

andygrove approved these changes

View reviewed changes

Member

andygrove left a comment

Thank you @rohitrastogi for this contribution! LGTM.


          remove branching in loop for legacy case

50e5017

andygrove merged commit c261af3 into apache:main

rohitrastogi commented

View reviewed changes

core/src/execution/datafusion/expressions/cast.rs

    
                              .iter()

                              .map(|value| match value {

                                  Some(value) => {

                                      let is_overflow = value.is_nan() || value.abs() as i32 == std::i32::MAX;

Contributor Author

rohitrastogi May 10, 2024 •

edited

Loading

@andygrove This condition is actually incorrect due to how it it handles INT MIN.
Should be something like:

let is_overflow = value.is_nan() || (value as f64).floor()  > (std::i32::MAX as f64)  || (value as f64).ceil() < (std::i32::MIN as f64);

This is what Scala does in FloatExactNumeric.

Working on a fix with some improved tests. It looks like there are some tedious edge cases on how Java/Scala format the error strings depending on how large the float is.

Rust and Scala format the same float as decimals with different precisions when printing, which makes it challenging to get the same error output as Spark in ANSI mode. Not sure how to address that - we may need to relax the exact string match criteria for float -> int conversions and warn users that though the error checking logic from vanilla Spark/Comet are the same, the error messages are different.

For example:
For INT_MAX, Rust prints 2.1474836E9, whereas Java prints 2.14748365E9. Both printouts correspond to the same float 2147483648.

core/src/execution/datafusion/expressions/cast.rs

    
                              .map(|value| match value {

                                  Some(value) => {

                                      let is_overflow =

                                          value.is_nan() || value.abs() as $rust_dest_type == $max_dest_val;

Contributor Author

rohitrastogi May 10, 2024 •

edited

Loading

This is also wrong.

Should be something like:

let is_overflow = value.is_nan() || (value as f64).floor()  > ($max_dest_val as f64) || (value as f64).ceil() < ($min_dest_val as f64);

core/src/execution/datafusion/expressions/cast.rs

    
                                  Some(value) => {

                                      let divisor = 10_i128.pow($scale as u32);

                                      let (truncated, decimal) = (value / divisor, (value % divisor).abs());

                                      let is_overflow = truncated.abs() > $max_dest_val.into();

Contributor Author

rohitrastogi May 10, 2024

Also wrong.

Should be:

let is_overflow = truncated > $max_dest_val.into() || truncated < $min_dest_val.into();

core/src/execution/datafusion/expressions/cast.rs

    
                                  Some(value) => {

                                      let divisor = 10_i128.pow($scale as u32);

                                      let (truncated, decimal) = (value / divisor, (value % divisor).abs());

                                      let is_overflow = truncated.abs() > std::i32::MAX.into();

Contributor Author

rohitrastogi May 10, 2024

Also wrong.

Should be:

let is_overflow = truncated > std::i32::MAX.into() || truncated < std::i32::MIN.into();

himadripal pushed a commit to himadripal/datafusion-comet that referenced this pull request


          feat: Implement Spark-compatible CAST from non-integral numeric types…

6acbeba

… to integral types (apache#399)

* WIP - float to int, sketchy

* WIP - extremely ugly but functional

* WIP - use macro

* simply further

* delete dead code

* make format

* progress on decimals

* refactor

* format decimal value in overflow exception

* wip - have to use 4 macros, need more decimal tests

* ready for review

* forgot to commit whoops

* bad merge

* address pr comments

* commit missed compatibility

* improve error message

* improve error message again

* revert perf reression in cast_int_to_int_macro

* remove branching in loop for legacy case

---------

Co-authored-by: Rohit Rastogi <rohitrastogi@Rohits-MacBook-Pro.local>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet