Skip to content

Conversation

@rishvin
Copy link
Contributor

@rishvin rishvin commented Jun 9, 2025

Which issue does this PR close?

Rationale for this change

Please see here.

What changes are included in this PR?

  • Added support for Int32 type for sha2 bit_length argument.
  • Refactored the code base to prevent duplicates.
  • Fixed SparkSha2 response to return hex output in lowercase.
  • Fixed SparkSha2 to not throw if the bit-length is not supported and instead return NULL.
  • Updated test cases in hash/sha2.slt accordingly.

Are these changes tested?

Tested the following ways,

Are there any user-facing changes?

Yes, the sha2 function can be specified in the SQL.

The SHA2 output changes in 2 ways with this change,

  • SHA2 hex output is lowercase instead of uppercase.
  • SHA2 doesn't throw error messages for unsupported bit-type instead returns NULL. This is in-accordance with Spark behaviour.

@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) spark labels Jun 9, 2025
@rishvin
Copy link
Contributor Author

rishvin commented Jun 9, 2025

@andygrove here are the SHA2 changes to be compliant with Spark.

@andygrove
Copy link
Member

@getChan fyi

let hexed: StringArray = array
.iter()
.map(|v| v.map(hex_bytes).transpose())
.map(|v| v.map(|b| hex_bytes(b, true)).transpose())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

@getChan getChan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contribution.
However, the hex function shuold not be modified.

let hexed: StringArray = array
.iter()
.map(|v| v.map(hex_bytes).transpose())
.map(|v| v.map(|b| hex_bytes(b, true)).transpose())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that Spark's hex function is expected to return uppercase characters.
https://spark.apache.org/docs/latest/api/sql/index.html#hex

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @getChan for pointing this out. In that case, I will make change for SHA2 hex representation only. Here is one example from spark-shell of SHA2,

scala> spark.sql("SELECT sha2('abc', 256) AS sha256_value").show(false)
+----------------------------------------------------------------+
|sha256_value                                                    |
+----------------------------------------------------------------+
|ba7816bf8f01cfea414140de5dae2223b00361a396177a9cb410ff61f20015ad|
+----------------------------------------------------------------+

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change to the hex function did not cause any test failures ... does this mean we are missing some tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do have hex related test cases in Comet but they seems to be passing.
Looks like this is happening because Comet has its own hex-function, for which we call hex_bytes. But for hex-bytes lower case is set to false. Hence, no failure.

So, looks like the Datafusion's hex functionality is not getting leveraged and Comet is handling this on its own, hence we did not see regression on the Comet test-suite.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I meant that we should have seen test failures in DataFusion after changing the hex behavior. It looks like we should add an slt test for hex as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have test in Datafusion also datafusion/sqllogictest/test_files/spark/math/hex.slt but my PR changes the cases to lower. I will have to revert them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry. I totally missed that when reviewing.

@rishvin rishvin requested review from andygrove and getChan June 11, 2025 04:50
@rishvin
Copy link
Contributor Author

rishvin commented Jun 11, 2025

Thanks @andygrove / @getChan for the feedback. Please review the changes.

@alamb
Copy link
Contributor

alamb commented Jun 11, 2025

FYI @shehabgamin

Copy link
Contributor

@getChan getChan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for refactoring, improvement

Copy link
Contributor

@shehabgamin shehabgamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SparkSha2 is not compliant with Spark and does not support Int32 type

5 participants