Skip to content

Handle the Count() bug in correlated scalar subquery #6497

@mingmwang

Description

@mingmwang

Describe the bug

When the scalar subquery contains aggregation, our current rewriting strategy(Kim' method) is do the aggregation and then join(Left out join), this will lead the Count bug.

Need to handle the unmatched rows and construct the computing Project

To Reproduce

No response

Expected behavior

No response

Additional context

Paper: Orthogonal optimization of subqueries and aggregation

Aggregate, then join.

This execution strategy was original proposed by Kim [11]. It is possible to aggregate directly over the orders table to obtain the total sales per customer and later join with the customer table. This alse allows pushing the aggregate condition below the join. The SQL formulation uses a derived table, not a subquery.

select c custkey
from customer,
(select o custkey from orders
group by c custkey
having 1000000 < sum(o totalprice))
as AggResult
where o custkey = c custkey

Section 3.2 Moving GroupBy around an outerjoin

In the result of an outerjoin, an unmatched row appears exactly once. Therefore given that our grouping columns include the key of the outer relation, a group that has an unmatched row cannot have any other row. The aggregate
functions use only the columns from the non-outer relation. For an unmatched row all these columns are NULL. Therefore the property of aggregate expressions that is important to us is the result of applying it to NULLs. If the result is
NULL as it is for most simple aggregate expressions, we need do nothing more. The outerjoin will automatically provide
the NULLs we need. For the aggregate expressions which do not result in a NULL, we need to add a project which
for each unmatched row sets the aggregate result to the appropriate constant value.

Note that this constant can be calculated at compile time. For count(*), the value of this constant is zero.

Formally we have

GA,F (S LOJpR) = πc(S LOJp(GA−columns(S),F R))

where the computing project πc introduces the non-NULL results if necessary.
No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions