-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Closed as not planned
Closed as not planned
Copy link
Labels
enhancementNew feature or requestNew feature or request
Description
Is your feature request related to a problem or challenge?
In the ClickBench benchmark queries, there are two datasets we use. A "single file" hits.parquet and "partitioned" which has 100 files in a directory. They hold the same data.
However DataFusion resolves hits.parquet such that columns like URL are a Utf8 or Utf8View while the same columns are resolved as Binary or BinaryView
This has caused some small slowdowns while enabling StringView by default -- see #12509
You can see the schema resolution by:
cd benchmarks
# download hits.parquet
./bench.sh data clickbench_1
# download hits_partitioned
./bench.sh data clickbench_partitionedThen run datafusion-cli:
cd data
# hits.parquet has Utf8 columns
datafusion-cli -c 'describe "hits.parquet"' | grep Utf8
| Title | Utf8 | NO |
| URL | Utf8 | NO |
| Referer | Utf8 | NO |
...
| UTMContent | Utf8 | NO |
| UTMTerm | Utf8 | NO |
| FromTag | Utf8 | NO |
# hits_patitioned has Binary type for the same columns
datafusion-cli -c 'describe "hits_partitioned"' | grep Binary
| Title | Binary | YES |
| URL | Binary | YES |
| Referer | Binary | YES |
...
| UTMContent | Binary | YES |
| UTMTerm | Binary | YES |
| FromTag | Binary | YES |It semes for some reason the individual files are all resolved to Binary:
datafusion-cli -c 'describe "hits_partitioned/hits_99.parquet"' | grep Binary
| Title | Binary | YES |
| URL | Binary | YES |
| Referer | Binary | YES |
| FlashMinor2 | Binary | YES |
| UserAgentMinor | Binary | YES |
...
datafusion-cli -c 'describe "hits_partitioned/hits_60.parquet"' | grep Binary
| Title | Binary | YES |
| URL | Binary | YES |
| Referer | Binary | YES |
| FlashMinor2 | Binary | YES |
| UserAgentMinor | Binary | YES |
...
Describe the solution you'd like
I would like ideally that the clickbench queries resolve to the same schema, in this case Utf8 given the contents of the files and the queries that treat it them as strings
Describe alternatives you've considered
No response
Additional context
No response
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request