Skip to content

Consider resolving a clickbench files as Utf8 (rather than binary) #12510

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

In the ClickBench benchmark queries, there are two datasets we use. A "single file" hits.parquet and "partitioned" which has 100 files in a directory. They hold the same data.

However DataFusion resolves hits.parquet such that columns like URL are a Utf8 or Utf8View while the same columns are resolved as Binary or BinaryView

This has caused some small slowdowns while enabling StringView by default -- see #12509

You can see the schema resolution by:

cd benchmarks
# download hits.parquet
./bench.sh data clickbench_1
# download hits_partitioned
./bench.sh data clickbench_partitioned

Then run datafusion-cli:

cd data
# hits.parquet has Utf8 columns
datafusion-cli -c 'describe "hits.parquet"' | grep Utf8
| Title                 | Utf8      | NO          |
| URL                   | Utf8      | NO          |
| Referer               | Utf8      | NO          |
...
| UTMContent            | Utf8      | NO          |
| UTMTerm               | Utf8      | NO          |
| FromTag               | Utf8      | NO          |

# hits_patitioned has Binary type for the same columns
datafusion-cli -c 'describe "hits_partitioned"' | grep Binary
| Title                 | Binary    | YES         |
| URL                   | Binary    | YES         |
| Referer               | Binary    | YES         |
...
| UTMContent            | Binary    | YES         |
| UTMTerm               | Binary    | YES         |
| FromTag               | Binary    | YES         |

It semes for some reason the individual files are all resolved to Binary:

datafusion-cli -c 'describe "hits_partitioned/hits_99.parquet"' | grep Binary
| Title                 | Binary    | YES         |
| URL                   | Binary    | YES         |
| Referer               | Binary    | YES         |
| FlashMinor2           | Binary    | YES         |
| UserAgentMinor        | Binary    | YES         |
...
datafusion-cli -c 'describe "hits_partitioned/hits_60.parquet"' | grep Binary
| Title                 | Binary    | YES         |
| URL                   | Binary    | YES         |
| Referer               | Binary    | YES         |
| FlashMinor2           | Binary    | YES         |
| UserAgentMinor        | Binary    | YES         |
...

Describe the solution you'd like

I would like ideally that the clickbench queries resolve to the same schema, in this case Utf8 given the contents of the files and the queries that treat it them as strings

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions