-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
The join order chosen for TPCH query 17 is bad making datafusion take much longer to execute the query
To Reproduce
Create Data:
cd arrow-datafusion/benchmarks
./bench.sh data tpch10Run query with datafusion-cli:
cd arrow-datafusion/benchmarks/data/tpch_sf10
datafusion-cli -c "select c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice, sum(l_quantity) from customer, orders, lineitem where o_orderkey in ( select l_orderkey from lineitem group by l_orderkey having sum(l_quantity) > 300 ) and c_custkey = o_custkey and o_orderkey = l_orderkey group by c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice order by o_totalprice desc, o_orderdate;"Here is the query:
select
c_name,
c_custkey,
o_orderkey,
o_orderdate,
o_totalprice,
sum(l_quantity)
from
customer,
orders,
lineitem
where o_orderkey in (
select l_orderkey
from lineitem
group by l_orderkey
having sum(l_quantity) > 300
)
and c_custkey = o_custkey
and o_orderkey = l_orderkey
group by c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice
order by o_totalprice desc, o_orderdate;Expected behavior
The input order of HashJoin(1) in the below plan should be swapped
DataFusion TPCH SF10 Q18 Plan
annotated with output row counts
1 Row This join is building a 60M row
hash table and probing 624(!!)
┌────────────────┐ rows, rather than building a
│ Aggregate │ 624 row hash table and probing
│ │ a 60M row table
│ │
└────────────────┘
│ │
▼ 2M Rows
┌────────────────┐ │
│ HashJoin (1) │
│ o_orderkey = │◀ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
│ l_orderkey │
└────┬──────┬────┘
│ │ 624 Rows
┌──────────────┘ └────────────────────────┐
60M Rows │ │
│ ▼
▼ 2M Rows ┌────────────────┐
┌────────────────┐ │ Filter │
│ HashJoin (2) │ │ │
│ o_orderkey = │ │ │
│ l_orderkey │ └────────────────┘
└───┬────────┬───┘ 60M Rows │
│ │ │
15M Rows ┌────────┘ └───────────────┐ │ 15M Rows
│ │ │
│ │ │
▼ ▼ ▼
┌────────────────┐ ┌──────────────────┐ ┌────────────────┐
│ HashJoin (3) │ │Scan: lineitem │ │ Aggregate │
│ c_custkey = │ │(no filters) │ │ │
│ o_custkey │ │ │ │ │
└───┬────────┬───┘ └──────────────────┘ └────────────────┘
1.5M Rows │ │ 15M Rows │
┌─────────┘ └──────────┐ │ 60M Rows
│ │ │
▼ ▼ ▼
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│Scan: customer │ │Scan: orders │ │Scan: lineitem │
│(no predicates) │ │(no predicates) │ │ │
│ │ │ │ │ │
└──────────────────┘ └──────────────────┘ └──────────────────┘
Additional context
No response
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working