Skip to content

Add planning benchmarks with parquet and sortedness #13098

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

@mnorfolk03 added planning benchmark for more sophisticated queries here #13085 ❤️

The benchmarks are in https://github.com/apache/datafusion/blob/main/datafusion/core/benches/sql_planner.rs

However, the planning benchmarks we have now don't reflect querying an actual data source such as parquet (they query an empty in-memory table)

One thing that might be helpful to improve more would be adding a ParquetExec as well as queries that have sortedness to reflect more real world cases

Describe the solution you'd like

I would like some planning benchmarks equivalent of planning against tables like this (docs here): https://datafusion.apache.org/user-guide/sql/ddl.html#create-external-table

CREATE EXTERNAL TABLE foo STORED AS PARQUET LOCATION '..'
CREATE EXTERNAL TABLE test (
    c1  VARCHAR NOT NULL,
    c2  INT NOT NULL,
    c3  SMALLINT NOT NULL,
    c4  SMALLINT NOT NULL,
    c5  INT NOT NULL,
    c6  BIGINT NOT NULL,
    c7  SMALLINT NOT NULL,
    c8  INT NOT NULL,
    c9  BIGINT NOT NULL,
    c10 VARCHAR NOT NULL,
    c11 FLOAT NOT NULL,
    c12 DOUBLE NOT NULL,
    c13 VARCHAR NOT NULL
)
STORED AS CSV
WITH ORDER (c2 ASC, c5 + c8 DESC NULL FIRST)
LOCATION '/path/to/aggregate_test_100.csv'
OPTIONS ('has_header' 'true');

Describe alternatives you've considered

One possibility could be to add a benchmark for planning the clickbench queries: https://github.com/apache/datafusion/tree/main/benchmarks/queries/clickbench

We could either use the smaller hits.parquet file here: https://github.com/apache/datafusion/blob/main/datafusion/core/tests/data/clickbench_hits_10.parquet

Additional context

No response

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions