Skip to content

ClickBench extended queries are not working - WITHIN GROUP clause is required when calling ordered set aggregate function(approx_percentile_cont) #15927

@alamb

Description

@alamb

Describe the bug

https://github.com/apache/datafusion/tree/main/benchmarks/queries/clickbench#extended-queries

The "extended" queries are not part of the official ClickBench benchmark. Instead they are used to test other DataFusion features that are not covered by the standard benchmark.

Recently I tried to run Q5 for benchmarking and I got an error:

Error during planning: WITHIN GROUP clause is required when calling ordered set aggregate function(approx_percentile_cont)

To Reproduce

Run this query in datafusion-cli

SELECT "ClientIP", "WatchID",  COUNT(*) c, MIN("ResponseStartTiming") tmin, APPROX_PERCENTILE_CONT("ResponseStartTiming", 0.95) tp95, MAX("ResponseStartTiming") tmax
FROM 'hits.parquet'
WHERE "JavaEnable" = 0 -- filters to 32M of 100M rows
GROUP BY  "ClientIP", "WatchID"
HAVING c > 1
ORDER BY tp95 DESC
LIMIT 10;

For example:

(venv) andrewlamb@Andrews-MacBook-Pro-2:~/Downloads$ ~/Software/datafusion/target/debug/datafusion-cli
DataFusion CLI v47.0.0
> SELECT "ClientIP", "WatchID",  COUNT(*) c, MIN("ResponseStartTiming") tmin, APPROX_PERCENTILE_CONT("ResponseStartTiming", 0.95) tp95, MAX("ResponseStartTiming") tmax
FROM 'hits.parquet'
WHERE "JavaEnable" = 0 -- filters to 32M of 100M rows
GROUP BY  "ClientIP", "WatchID"
HAVING c > 1
ORDER BY tp95 DESC
LIMIT 10;
Error during planning: WITHIN GROUP clause is required when calling ordered set aggregate function(approx_percentile_cont)

Expected behavior

DF 47 runs the query

(venv) andrewlamb@Andrews-MacBook-Pro-2:~/Downloads$ ~/Software/datafusion-cli/datafusion-cli-47.0.0
DataFusion CLI v47.0.0
> SELECT "ClientIP", "WatchID",  COUNT(*) c, MIN("ResponseStartTiming") tmin, APPROX_PERCENTILE_CONT("ResponseStartTiming", 0.95) tp95, MAX("ResponseStartTiming") tmax
FROM 'hits.parquet'
WHERE "JavaEnable" = 0 -- filters to 32M of 100M rows
GROUP BY  "ClientIP", "WatchID"
HAVING c > 1
ORDER BY tp95 DESC
LIMIT 10;
+-------------+---------------------+---+------+------+------+
| ClientIP    | WatchID             | c | tmin | tp95 | tmax |
+-------------+---------------------+---+------+------+------+
| 1611957945  | 6655575552203051303 | 2 | 0    | 0    | 0    |
| -1402644643 | 8566928176839891583 | 2 | 0    | 0    | 0    |
+-------------+---------------------+---+------+------+------+
2 row(s) fetched.
Elapsed 5.360 seconds.

Additional context

However, it seems like postgres also rejects such queries without WIHINGROUP:

See this dbfiddle: https://www.db-fiddle.com/f/5dwiFr16TvBF8zF6f2TmSz/0

Image

So I believe the right fix is to update the benchmark query

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions