-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Decimal multiply kernel should not cause precision loss #5675
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
There is more broader issue on type coercion. I'm working on this and will update. |
e5e3bcb to
b108159
Compare
b108159 to
9ba3285
Compare
|
rust-decimal only supports precision <= 28... |
| sum(l_extendedprice) as sum_base_price, | ||
| sum(l_extendedprice * (1 - l_discount)) as sum_disc_price, | ||
| sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge, | ||
| sum(cast(l_extendedprice as decimal(12,2)) * (1 - l_discount) * (1 + l_tax)) as sum_charge, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This decimal calculation actually causes overflow on decimal multiplication because the kernel DataFusion uses doesn't allow precision loss. Previously it accidentally works because we implicitly truncate all input decimal value before multiplying...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I add a cast to get the test result unchanged. To get rid of this cast later, we can use new multiply kernel at arrow-rs which allows precision-loss.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps we can add a comment here referencing the new arrow-rs kernel / work. Otherwise this context will likely get forgotten
| pub fn i128_to_str(value: i128, scale: u32) -> String { | ||
| pub fn i128_to_str(value: i128, precision: &u8, scale: &i8) -> String { | ||
| big_decimal_to_str( | ||
| BigDecimal::from_str(&Decimal::from_i128_with_scale(value, scale).to_string()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rust-decimal doesn't allow precision > 28. We can simply use Decimal128Type from arrow-rs to get string of decimal.
32e5dfc to
e727f5a
Compare
| cast(cast(sum(case | ||
| when nation = 'BRAZIL' then volume | ||
| else 0 | ||
| end) / sum(volume) as mkt_share | ||
| end) as decimal(12,2)) / cast(sum(volume) as decimal(12,2)) as decimal(15,2)) as mkt_share |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current decimal divide kernel fails to compute this. The result decimal type is 38 scale. While divide_dyn_opt_decimal multiplies left array with 10.pow(scale), obviously it overflows because the multiply kernel doesn't allow precision-loss multiplication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With inner casts wrapping two sum, we can use a result scale which won't cause overflow. The result decimal is correct.
But the answer file is fixed as decimal(15, 2) so the result still fails the check. I only can add another cast to make it as same decimal type to pass this result check.
b1292ba to
fb9a27a
Compare
|
More changes than I expected, but finally making DataFusion all tests happy... |
|
I'll try and find time to review this PR tomorrow |
|
cc @liukun4515 |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @viirya
I took a look through this code -- the coercion rule changes make sense to me. However, I don't undertand the need for Expr::PromotePrecision or the new data_type field on Expr. They don't seem to be related to improving the coercion for decimals
| } | ||
| } | ||
|
|
||
| /// Cast expression |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please add documentation about what PromotePrecision does and in what circumstances it is needed?
| sum(l_extendedprice) as sum_base_price, | ||
| sum(l_extendedprice * (1 - l_discount)) as sum_disc_price, | ||
| sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge, | ||
| sum(cast(l_extendedprice as decimal(12,2)) * (1 - l_discount) * (1 + l_tax)) as sum_charge, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps we can add a comment here referencing the new arrow-rs kernel / work. Otherwise this context will likely get forgotten
| /// cast to a type with respect to a schema | ||
| fn cast_to<S: ExprSchema>(self, cast_to_type: &DataType, schema: &S) -> Result<Expr>; | ||
|
|
||
| /// promote to a type with respect to a schema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add some comments to clarify what the difference between "promote" and "cast" is?
| pub op: Operator, | ||
| /// Right-hand side of the expression | ||
| pub right: Box<Expr>, | ||
| /// The data type of the expression, if known |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the reason for this change -- is it no longer possible to calculate the output type of a BinaryExpr from its inputs?
If the desired output type is different then couldn't we cast the expr to the type?
The coercion rule that modifies sides of arithmetic op is not idempotent. Multiple runs of the rule will change it to incorrect result. So we need something to prevent the rule on coerced sides. For the new |
This makes sense -- thank you My biggest concern with the approach in this PR is that it adds some special case for decimal (which seems reasonable in itself) but that special case bleeds out over the rest of the code / plans, which is probably why it got so big. I wonder would it be possible to use Perhaps something like the following. /// The target decimal type the expression should be output to
struct PromotePrecision {
expr: Box<Expr>,
precision: u8,
scale: u8,
}One of the downsides of this approach is that PromotePrecision is now clearly special purpose for Decimal. I am not sure if it would have other uses in the future 🤔 |
|
Keeping precision/scale in |
I think then it would be best to keep the data type in |
Thanks for your mention I will take a look it carefully tomorrow. @jackwener has moved the type coercion to the analysis phase. Maybe we have some same comments about the rule of type coercion. |
|
Yea, seems I need to rebase this. |
|
Does this new |
It is for Decimal type only. Yea, the idea is from Spark. It was removed in latest Spark as the precision promotion is moved to where the computation happens. I was not sure if moving it into decimal kernels is a good idea for DataFusion. Let me re-think about it and maybe refactor this. |
|
marking as draft while it gets worked on |
|
Thanks @alamb. I'll find some time on this. :) |
|
This new approach is proposed at #5980. |
Which issue does this PR close?
Closes #5674.
Rationale for this change
Currently decimal multiplication in DataFusion silently truncates precision of result. It happens generally for regular decimal multiplication which doesn't overflow. Looks like DataFusion uses incomplete decimal precision coercion rule from Spark to coerce sides of decimal multiplication (and other arithmetic operators). The coerced type on two sides of decimal multiplication is not the resulting decimal type of multiplication. This (and how we computes decimal multiplication in the kernels) leads to truncated precision in the result decimal type.
What changes are included in this PR?
This patch updates type coercion rule for decimal arithmetic operators. To prevent the type coercion rule applied on sides of decimal operators, we need one expression node to wrap coerced sides.
Are these changes tested?
Are there any user-facing changes?