-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Describe the bug
Joins where the ON filter are not equality, but rather inequalities like <, `> etc. seem slow. Atleast compared to DuckDB which seem like a direct "competitor".
The main difference between the DuckDB and Datafusion plans seem to be that Datafusion uses a NestedLoopJoinExec, while DuckDB uses a IEJoin.
Note that the query could be written better with a ASOF-join, but Datafusion does not support that (see issue #318).
To Reproduce
Create some test data with this SQL (saved as repro-dataset.sql) in DuckDB:
CREATE
OR REPLACE TABLE pricing AS
SELECT
t,
RANDOM() as v
FROM
range(
'2022-01-01' :: TIMESTAMP,
'2023-01-01' :: TIMESTAMP,
INTERVAL 30 DAY
) ts(t);
COPY pricing to 'pricing.parquet' (format 'parquet');
CREATE
OR REPLACE TABLE timestamps AS
SELECT
t
FROM
range(
'2022-01-01' :: TIMESTAMP,
'2023-01-01' :: TIMESTAMP,
INTERVAL 10 SECOND
) ts(t);
COPY timestamps to 'timestamps.parquet' (format 'parquet');$ duckdb < repro-dataset.sqlWe will compare the performance of the following query in DuckDB and Datafusion. The query is saved as repro-range-query.sql.
WITH pricing_state AS (
SELECT
t as valid_from,
COALESCE(
LEAD(t, 1) OVER (
ORDER BY
t
),
'9999-12-31'
) as valid_to,
v
FROM
'pricing.parquet'
)
SELECT
t.t,
p.v
FROM
pricing_state p
LEFT JOIN 'timestamps.parquet' t ON t.t BETWEEN p.valid_from
AND p.valid_to;DuckDB performance:
$ time duckdb < repro-range-query.sql
...
real 0m0.999s
user 0m6.070s
sys 0m3.600sDatafusion performance:
$ time datafusion-cli -f repro-range-query.sql
...
real 0m8.269s
user 0m6.358s
sys 0m1.907sExpected behavior
It would be nice if the above query (or something equivalent) would be faster in Datafusion.
If someone knows of a better way to express the query, then that could also be a workaround for me.
Additional context
Machine tested on:
CPU:Ryzen 3900x
OS: Ubuntu 22.04
Versions used:
$ duckdb --version
v0.9.2 3c695d7ba9$ datafusion-cli --version
datafusion-cli 33.0.0