-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Optimize date_trunc function by avoiding allocations
#18360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The array-based implementation of date_trunc can produce incorrect results for negative timestamps (i.e. dates before 1970-01-01). Check for any such incorrect values and compensate accordingly.
| (Nanosecond, "minute") => Some(Int64Array::new_scalar(60_000_000_000)), | ||
| (Nanosecond, "hour") => Some(Int64Array::new_scalar(3_600_000_000_000)), | ||
| (Nanosecond, "day") => Some(Int64Array::new_scalar(86_400_000_000_000)), | ||
| let unit: Option<i64> = match (tu, granularity) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The key idea here is to make this code faster by reusing the allocation and operating in place rather than allocating new arrays
date_trunc moredate_trunc function by avoiding allocations
| i.checked_div(unit) | ||
| .ok_or_else(|| exec_datafusion_err!("division overflow")) | ||
| })?; | ||
| let array = try_unary_mut_or_clone(array, |i| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
technically speaking, only the first try_unary_mut_or_clone is needed
on the second transformation, we're guaranteed to be the pointer into the array, and the so taking the or_clone path would be an error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is true, though I don't know how to represent this in code.
Maybe I could make a second function try_unary_mut_or_error that throws a runtime error 🤔
| fn try_unary_mut_or_clone<F>( | ||
| array: PrimitiveArray<Int64Type>, | ||
| op: F, | ||
| ) -> Result<PrimitiveArray<Int64Type>> | ||
| where | ||
| F: Fn(i64) -> Result<i64>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not really date_trunc specific. can this be made more flexible with a more generous use of generics?
perhaps it could even be in arrow-rs. it makes try_unary_mut significantly more approachable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I agree -- the try_unary_mut is quite awkward to use. I will see if I can port some of these changes upstream / see what they look like
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I filed Hard to use
PrimitiveArray::unary_mut,PrimitiveArray:try_unary_mut, etc arrow-rs#8808 to track this idea
|
🤖 |
|
🤖: Benchmark completed Details
|
|
We went with a dfferent approach , and I have filed the follow on ticket |
Which issue does this PR close?
DATE_TRUNCexpression regression for values before epoch #18334Rationale for this change
@mhilton's change to
date_truncto make it correct also potentially slows it downLet's try and recover the performance (and then some) so DataFusion 51 can be both more correct and faster
What changes are included in this PR?
Use
try_unary_opmethods to avoid allocating intermediate arrays to improve performanceAre these changes tested?
functionally by CI
I will run benchmarks on this PR as well
Are there any user-facing changes?