Skip to content

Bad Join Order for TPCH Q18 results in slow performance #7950

@alamb

Description

@alamb

Describe the bug

The join order chosen for TPCH query 17 is bad making datafusion take much longer to execute the query

To Reproduce

Create Data:

cd arrow-datafusion/benchmarks
./bench.sh  data tpch10

Run query with datafusion-cli:

cd arrow-datafusion/benchmarks/data/tpch_sf10
datafusion-cli -c "select	c_name,	c_custkey,	o_orderkey,	o_orderdate,	o_totalprice,	sum(l_quantity)	from	customer,	orders,	lineitem	where	o_orderkey in (	select	l_orderkey	from	lineitem	group by	l_orderkey having	sum(l_quantity) > 300	)	and c_custkey = o_custkey	and o_orderkey = l_orderkey	group by	c_name,	c_custkey,	o_orderkey,	o_orderdate,	o_totalprice	order by	o_totalprice desc,	o_orderdate;"

Here is the query:

select
c_name,
c_custkey,
o_orderkey,
o_orderdate,
o_totalprice,
sum(l_quantity)
from
customer,
orders,
lineitem
where	o_orderkey in (
  select l_orderkey
  from	lineitem
  group by l_orderkey
  having	sum(l_quantity) > 300
  )
and c_custkey = o_custkey
and o_orderkey = l_orderkey
group by c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice
order by o_totalprice desc, o_orderdate;

Expected behavior

The input order of HashJoin(1) in the below plan should be swapped

         DataFusion TPCH SF10 Q18 Plan                                                                    
        annotated with output row counts                                                                  
                                                                                                          
                                                         1 Row             This join is building a 60M row
                                                                           hash table and probing 624(!!) 
                                                  ┌────────────────┐        rows, rather than building a  
                                                  │   Aggregate    │       624 row hash table and probing 
                                                  │                │               a 60M row table        
                                                  │                │                                      
                                                  └────────────────┘                                      
                                                           │                              │               
                                                           ▼    2M Rows                                   
                                                  ┌────────────────┐                      │               
                                                  │  HashJoin (1)  │                                      
                                                  │   o_orderkey = │◀ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘               
                                                  │   l_orderkey   │                                      
                                                  └────┬──────┬────┘                                      
                                                       │      │         624 Rows                          
                                        ┌──────────────┘      └────────────────────────┐                  
                          60M Rows      │                                              │                  
                                        │                                              ▼                  
                                        ▼                            2M Rows  ┌────────────────┐          
                               ┌────────────────┐                             │     Filter     │          
                               │  HashJoin (2)  │                             │                │          
                               │   o_orderkey = │                             │                │          
                               │   l_orderkey   │                             └────────────────┘          
                               └───┬────────┬───┘     60M Rows                         │                  
                                   │        │                                          │                  
              15M Rows    ┌────────┘        └───────────────┐                          │  15M Rows        
                          │                                 │                          │                  
                          │                                 │                          │                  
                          ▼                                 ▼                          ▼                  
                 ┌────────────────┐               ┌──────────────────┐        ┌────────────────┐          
                 │  HashJoin (3)  │               │Scan: lineitem    │        │   Aggregate    │          
                 │   c_custkey =  │               │(no filters)      │        │                │          
                 │   o_custkey    │               │                  │        │                │          
                 └───┬────────┬───┘               └──────────────────┘        └────────────────┘          
  1.5M Rows          │        │       15M Rows                                         │                  
           ┌─────────┘        └──────────┐                                             │   60M Rows       
           │                             │                                             │                  
           ▼                             ▼                                             ▼                  
 ┌──────────────────┐          ┌──────────────────┐                          ┌──────────────────┐         
 │Scan: customer    │          │Scan: orders      │                          │Scan: lineitem    │         
 │(no predicates)   │          │(no predicates)   │                          │                  │         
 │                  │          │                  │                          │                  │         
 └──────────────────┘          └──────────────────┘                          └──────────────────┘         

Additional context

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions