Skip to content

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented Oct 29, 2025

Which issue does this PR close?

Rationale for this change

@mhilton's change to date_trunc to make it correct also potentially slows it down

Let's try and recover the performance (and then some) so DataFusion 51 can be both more correct and faster

What changes are included in this PR?

Use try_unary_op methods to avoid allocating intermediate arrays to improve performance

Are these changes tested?

functionally by CI

I will run benchmarks on this PR as well

Are there any user-facing changes?

mhilton and others added 2 commits October 29, 2025 15:59
The array-based implementation of date_trunc can produce incorrect
results for negative timestamps (i.e. dates before 1970-01-01). Check
for any such incorrect values and compensate accordingly.
@alamb alamb added the performance Make DataFusion faster label Oct 29, 2025
@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Oct 29, 2025
@alamb alamb changed the title Alamb/issue 18344 opt Optimize version of date_trunc Oct 29, 2025
@alamb alamb changed the title Optimize version of date_trunc Optimize date_trunc more Oct 29, 2025
(Nanosecond, "minute") => Some(Int64Array::new_scalar(60_000_000_000)),
(Nanosecond, "hour") => Some(Int64Array::new_scalar(3_600_000_000_000)),
(Nanosecond, "day") => Some(Int64Array::new_scalar(86_400_000_000_000)),
let unit: Option<i64> = match (tu, granularity) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key idea here is to make this code faster by reusing the allocation and operating in place rather than allocating new arrays

@alamb alamb changed the title Optimize date_trunc more Optimize date_trunc function by avoiding allocations Oct 29, 2025
i.checked_div(unit)
.ok_or_else(|| exec_datafusion_err!("division overflow"))
})?;
let array = try_unary_mut_or_clone(array, |i| {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically speaking, only the first try_unary_mut_or_clone is needed
on the second transformation, we're guaranteed to be the pointer into the array, and the so taking the or_clone path would be an error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is true, though I don't know how to represent this in code.

Maybe I could make a second function try_unary_mut_or_error that throws a runtime error 🤔

Comment on lines +519 to +524
fn try_unary_mut_or_clone<F>(
array: PrimitiveArray<Int64Type>,
op: F,
) -> Result<PrimitiveArray<Int64Type>>
where
F: Fn(i64) -> Result<i64>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really date_trunc specific. can this be made more flexible with a more generous use of generics?
perhaps it could even be in arrow-rs. it makes try_unary_mut significantly more approachable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I agree -- the try_unary_mut is quite awkward to use. I will see if I can port some of these changes upstream / see what they look like

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb
Copy link
Contributor Author

alamb commented Oct 30, 2025

🤖 ./gh_compare_branch_bench.sh Benchmark Script Running
Linux aal-dev 6.14.0-1017-gcp #18~24.04.1-Ubuntu SMP Tue Sep 23 17:51:44 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/issue-18344-opt (dc2b5f7) to 6cc73fa diff
BENCH_NAME=date_trunc
BENCH_COMMAND=cargo bench --bench date_trunc
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_issue-18344-opt
Results will be posted here when complete

@alamb
Copy link
Contributor Author

alamb commented Oct 30, 2025

🤖: Benchmark completed

Details

group                     alamb_issue-18344-opt                  main
-----                     ---------------------                  ----
date_trunc_minute_1000    1.10      5.2±0.02µs        ? ?/sec    1.00      4.8±0.01µs        ? ?/sec

@alamb
Copy link
Contributor Author

alamb commented Nov 7, 2025

We went with a dfferent approach , and I have filed the follow on ticket

@alamb alamb closed this Nov 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation performance Make DataFusion faster sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants