test(native): Add comprehensive tests for Iceberg bucket function#26581
test(native): Add comprehensive tests for Iceberg bucket function#26581Joe-Abraham wants to merge 1 commit into
Conversation
41d5180 to
fe7c2b2
Compare
Reviewer's GuideAdjusts the Iceberg Class diagram for updated IcebergBucketFunction scalar bucket overloadsclassDiagram
class IcebergBucketFunction {
<<final>>
- IcebergBucketFunction()
+ bucketInteger(numberOfBuckets: int, value: long) long
+ bucketVarchar(numberOfBuckets: int, value: Slice) long
+ bucketVarbinary(numberOfBuckets: int, value: Slice) long
+ bucketDate(numberOfBuckets: int, value: long) long
+ bucketTimestamp(numberOfBuckets: int, value: long) long
+ bucketTimestampWithTimeZone(numberOfBuckets: int, value: long) long
}
class Bucket {
<<static>>
+ bucketShortDecimal(numPrecision: long, numScale: long, numberOfBuckets: int, value: long) long
+ bucketLongDecimal(numPrecision: long, numScale: long, numberOfBuckets: int, value: Slice) long
}
IcebergBucketFunction *-- Bucket
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey there - I've reviewed your changes and found some issues that need to be addressed.
- The updated test calls for
bucketShortDecimalandbucketLongDecimalappear to have precision/scale parameters swapped (e.g.,bucketShortDecimal(12, 5, 12, 45643)forDECIMAL(5,2)), which no longer match theDECIMAL(p, s)types used in the SQL and likely need to be reordered to(p, s, numberOfBuckets, value). - Now that the SQL type of the bucket functions is
INTEGER, consider changing the Java return type toint(or avoiding unnecessary(long)casts) to better reflect the underlying type and avoid confusion about the actual range of values.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The updated test calls for `bucketShortDecimal` and `bucketLongDecimal` appear to have precision/scale parameters swapped (e.g., `bucketShortDecimal(12, 5, 12, 45643)` for `DECIMAL(5,2)`), which no longer match the `DECIMAL(p, s)` types used in the SQL and likely need to be reordered to `(p, s, numberOfBuckets, value)`.
- Now that the SQL type of the bucket functions is `INTEGER`, consider changing the Java return type to `int` (or avoiding unnecessary `(long)` casts) to better reflect the underlying type and avoid confusion about the actual range of values.
## Individual Comments
### Comment 1
<location> `presto-iceberg/src/main/java/com/facebook/presto/iceberg/function/IcebergBucketFunction.java:38` </location>
<code_context>
- @SqlType(StandardTypes.BIGINT)
- public static long bucketInteger(@SqlType(StandardTypes.BIGINT) long value, @SqlType(StandardTypes.INTEGER) long numberOfBuckets)
+ @SqlType(StandardTypes.INTEGER)
+ public static long bucketInteger(@SqlType(StandardTypes.INTEGER) long numberOfBuckets, @SqlType(StandardTypes.BIGINT) long value)
{
- return Transforms.bucket((int) numberOfBuckets)
</code_context>
<issue_to_address>
**issue (bug_risk):** Swapping parameter order is a breaking change for existing SQL calls to `bucket`.
Changing the signature from `bucket(value, numberOfBuckets)` to `bucket(numberOfBuckets, value)` plus the new `@SqlType(StandardTypes.INTEGER)` on the first parameter makes the SQL signature `(integer, <type>)` instead of `(<type>, integer)`. Existing queries like `bucket(col, 16)` will no longer resolve because that overload no longer exists. If this break is intentional, it needs to be treated and communicated as a SQL-level breaking change; otherwise, keep the original parameter order and adjust only the return type if needed.
</issue_to_address>
### Comment 2
<location> `presto-iceberg/src/main/java/com/facebook/presto/iceberg/function/IcebergBucketFunction.java:37-46` </location>
<code_context>
@ScalarFunction("bucket")
- @SqlType(StandardTypes.BIGINT)
- public static long bucketVarchar(@SqlType(StandardTypes.VARCHAR) Slice value, @SqlType(StandardTypes.INTEGER) long numberOfBuckets)
+ @SqlType(StandardTypes.INTEGER)
+ public static long bucketVarchar(@SqlType(StandardTypes.INTEGER) long numberOfBuckets, @SqlType(StandardTypes.VARCHAR) Slice value)
{
return (long) Transforms.bucket((int) numberOfBuckets)
</code_context>
<issue_to_address>
**issue (bug_risk):** Changing the SQL return type from BIGINT to INTEGER alters the observable contract of `bucket`.
This SQL-level type change can break existing queries, views, or integrations that rely on `bucket` returning BIGINT (e.g., joins, casts, or UDFs typed as BIGINT). If the goal is to match Iceberg’s bucket spec, we still need a clear compatibility/migration plan; otherwise, keep the SQL return type as BIGINT and only narrow it where we can guarantee safety.
</issue_to_address>
### Comment 3
<location> `presto-iceberg/src/test/java/com/facebook/presto/iceberg/TestIcebergScalarFunctions.java:70` </location>
<code_context>
+ functionAssertions.assertFunction(catalogSchema + ".bucket(4, cast(1950 as smallint))", INTEGER, (int) bucketInteger(4, 1950));
+ functionAssertions.assertFunction(catalogSchema + ".bucket(5, cast(2375645 as int))", INTEGER, (int) bucketInteger(5, 2375645));
+ functionAssertions.assertFunction(catalogSchema + ".bucket(6, cast(2779099983928392323 as BIGINT))", INTEGER, (int) bucketInteger(6, 2779099983928392323L));
+ functionAssertions.assertFunction(catalogSchema + ".bucket(12, cast(456.43 as DECIMAL(5,2)))", INTEGER, (int) bucketShortDecimal(12, 5, 12, 45643));
+ functionAssertions.assertFunction(catalogSchema + ".bucket(12, cast('12345678901234567890.1234567890' as DECIMAL(30,10)))", INTEGER, (int) bucketLongDecimal(10, 30, 12, encodeScaledValue(new BigDecimal("12345678901234567890.1234567890"))));
</code_context>
<issue_to_address>
**issue (testing):** Decimal test passes bucket and precision/scale arguments to `bucketShortDecimal` in the wrong order, so it no longer matches the SQL call being tested.
`bucketShortDecimal` is now `(numPrecision, numScale, numberOfBuckets, value)`, but the test calls `bucketShortDecimal(12, 5, 12, 45643)`. For `DECIMAL(5,2)` with `numberOfBuckets = 12`, this should be `(5, 2, 12, 45643)`. As written, the test asserts against the wrong configuration. Please update the helper call to use the correct argument order and values so it matches the SQL expression.
</issue_to_address>
### Comment 4
<location> `presto-iceberg/src/test/java/com/facebook/presto/iceberg/TestIcebergScalarFunctions.java:71` </location>
<code_context>
+ functionAssertions.assertFunction(catalogSchema + ".bucket(5, cast(2375645 as int))", INTEGER, (int) bucketInteger(5, 2375645));
+ functionAssertions.assertFunction(catalogSchema + ".bucket(6, cast(2779099983928392323 as BIGINT))", INTEGER, (int) bucketInteger(6, 2779099983928392323L));
+ functionAssertions.assertFunction(catalogSchema + ".bucket(12, cast(456.43 as DECIMAL(5,2)))", INTEGER, (int) bucketShortDecimal(12, 5, 12, 45643));
+ functionAssertions.assertFunction(catalogSchema + ".bucket(12, cast('12345678901234567890.1234567890' as DECIMAL(30,10)))", INTEGER, (int) bucketLongDecimal(10, 30, 12, encodeScaledValue(new BigDecimal("12345678901234567890.1234567890"))));
- functionAssertions.assertFunction(catalogSchema + ".bucket(cast('nasdbsdnsdms' as varchar), 7)", BIGINT, bucketVarchar(utf8Slice("nasdbsdnsdms"), 7));
</code_context>
<issue_to_address>
**issue (testing):** Long-decimal test also swaps precision and scale when calling `bucketLongDecimal`, so the expected value may not reflect the actual SQL behavior.
In the `DECIMAL(30,10)` test, `bucketLongDecimal` expects `(numPrecision, numScale, numberOfBuckets, value)`, but it’s called as `bucketLongDecimal(10, 30, 12, ...)` while the SQL uses `DECIMAL(30,10)`. This means precision and scale are swapped, so the test no longer matches the SQL contract for `bucket`. Please correct the call to `bucketLongDecimal(30, 10, 12, ...)` so the helper aligns with the SQL expression.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| @SqlType(StandardTypes.BIGINT) | ||
| public static long bucketInteger(@SqlType(StandardTypes.BIGINT) long value, @SqlType(StandardTypes.INTEGER) long numberOfBuckets) | ||
| @SqlType(StandardTypes.INTEGER) | ||
| public static long bucketInteger(@SqlType(StandardTypes.INTEGER) long numberOfBuckets, @SqlType(StandardTypes.BIGINT) long value) |
There was a problem hiding this comment.
issue (bug_risk): Swapping parameter order is a breaking change for existing SQL calls to bucket.
Changing the signature from bucket(value, numberOfBuckets) to bucket(numberOfBuckets, value) plus the new @SqlType(StandardTypes.INTEGER) on the first parameter makes the SQL signature (integer, <type>) instead of (<type>, integer). Existing queries like bucket(col, 16) will no longer resolve because that overload no longer exists. If this break is intentional, it needs to be treated and communicated as a SQL-level breaking change; otherwise, keep the original parameter order and adjust only the return type if needed.
There was a problem hiding this comment.
I agree with sourcery-ai. This is a breaking change.
The existing supported DDL when creating partition table is
partitioning=ARRAY['bucket(c_int,10)', 'bucket(c_bigint,13)'])
I think we should follow this pattern.
hantangwangd
left a comment
There was a problem hiding this comment.
@Joe-Abraham thanks for this change. Following Iceberg's specifications and Spark's syntax, I agree with you that using bucket(n, value) is the correct way because, as defined by specification, n is used to parameterize the bucket transform as bucket_n, rather than simply serving as a function argument. But Trino appears to use truncate(value, n).
Besides, considering our existing syntax for specifying partition transforms when creating an Iceberg table—bucket(value, n)/truncate(value, n)—I'm wondering whether we should apply the same adjustment as well. This would be a breaking change from a user-facing SQL syntax perspective. What are your thoughts on this? @tdcmeehan @Joe-Abraham
| import static com.facebook.presto.nativeworker.PrestoNativeQueryRunnerUtils.javaIcebergQueryRunnerBuilder; | ||
| import static com.facebook.presto.nativeworker.PrestoNativeQueryRunnerUtils.nativeIcebergQueryRunnerBuilder; | ||
|
|
||
| public class TestBucketFunction |
There was a problem hiding this comment.
What's the purpose of these testcases?
It seems it just verifying the existence of iceberg.system.bucket, should we verify the bucket result too?
When running query with native query runner, which bucket function is been called, the Java version or native version?
| @SqlType(StandardTypes.BIGINT) | ||
| public static long bucketInteger(@SqlType(StandardTypes.BIGINT) long value, @SqlType(StandardTypes.INTEGER) long numberOfBuckets) | ||
| @SqlType(StandardTypes.INTEGER) | ||
| public static long bucketInteger(@SqlType(StandardTypes.INTEGER) long numberOfBuckets, @SqlType(StandardTypes.BIGINT) long value) |
There was a problem hiding this comment.
I agree with sourcery-ai. This is a breaking change.
The existing supported DDL when creating partition table is
partitioning=ARRAY['bucket(c_int,10)', 'bucket(c_bigint,13)'])
I think we should follow this pattern.
Description
Motivation and Context
#26558
Contributor checklist
Release Notes