Skip to content

HAVING doesn't work with ORDER BY #10013

@cseidman

Description

@cseidman

Describe the bug

I found a case where a query fails when there is both a HAVING clause and an ORDER BY clause. The query will work if it has either the HAVING clause or the ORDER BY clause .. but not both.

Here is a failing query:

with revs
as (
    SELECT
        customer,
        cast(round(CAST(substring(revenue,2) AS float)) as INT) AS visit_revenue
    FROM visits
)
SELECT
        customer,
        SUM(visit_revenue) as total_revenue
FROM revs
group by customer
having SUM(visit_revenue) > 999
order by SUM(visit_revenue)

The error I get is:

Error: SchemaError(FieldNotFound { field: Column { relation: None, name: "visit_revenue" }, valid_fields: [Column { relation: Some(Bare { table: "revs" }), name: "customer" }, Column { relation: None, name: "total_revenue" }, Column { relation: Some(Bare { table: "revs" }), name: "customer" }, Column { relation: None, name: "SUM(revs.visit_revenue)" }] }, Some(""))

If I comment out either the line with having SUM(visit_revenue) > 999 OR order by SUM(visit_revenue) the the query will succeed.

To Reproduce

To reproduce, run this program with the attached CSV file as data.

use datafusion::arrow::datatypes::{DataType, Field, Schema};
use datafusion::prelude::{CsvReadOptions, SessionContext};

#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
    let file_name = "data/test.csv";

    let schema = Schema::new(vec![
        Field::new("date", DataType::Utf8, false),
        Field::new("weekday", DataType::Utf8, false),
        Field::new("time", DataType::Utf8, false),
        Field::new("customer", DataType::Utf8, false),
        Field::new("revenue",  DataType::Utf8, false),
    ]);

    let options = CsvReadOptions::new()
        .schema(&schema)
        .delimiter(b'\t')
        .has_header(true)
        ;

    let ctx = SessionContext::new();
    ctx.register_csv("visits", file_name, options).await?;

    let df = ctx.sql(r#"
                with revs
                as (
                    SELECT
                        customer,
                        cast(round(CAST(substring(revenue,2) AS float)) as INT) AS visit_revenue
                    FROM visits
                )
                SELECT
                        customer,
                        SUM(visit_revenue) as total_revenue
                FROM revs
                group by customer
                having SUM(visit_revenue) > 999
                order by SUM(visit_revenue)
                "#).await?;

    df.show().await?;
    Ok(())
}

The attached file contains the data:
test.csv

Expected behavior

The expectation is that the result set is both filtered for records where the sum of visit_revenue > 999 AND ordered by the same without crashing.

Additional context

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions