-
Notifications
You must be signed in to change notification settings - Fork 71
Description
We've often shown examples of querying GitHub Archive data with super. Running fuse on the original JSON, outputting it as Parquet, and working with it as Parquet has been one of the approaches shown to compare/contrast with table-based legacy SQL approaches. Correlated with the changes in #6633, attempts to output the fuse'd JSON as Parquet have started failing.
Details
Repro is with super commit cbb4109 (which is associated with the merge of the changes in #6633) and subsequent commit 4dbc2fe, which return different errors.
To download the JSON GitHub Archive data:
$ for num in $(seq 0 23); do file="2023-02-08-${num}.json.gz"; curl -L -O "https://data.gharchive.org/$file"; done
The command that worked in the past (such as at commit 1bced44 right before the merge of #6633) that now fails at commit cbb4109:
$ super -version && super -f parquet -o gha-super.parquet -c 'fuse' 2023-02-08-*.json.gz
Version: v0.1.0-23-gcbb41094c
parquetio: not a record: error("missing field \"commits\" is not nullable")
Then at the very next commit 4dbc2fe, there's a different error.
$ super -version && super -f parquet -o gha-super.parquet -c 'fuse' 2023-02-08-*.json.gz
Version: v0.1.0-24-g4dbc2fe29
parquetio: unsupported type: not implemented: support for DENSE_UNION
As a user, it's not clear to me if the first error is still present but "hiding" behind the second.
In any case, the symptom remains through to current tip of main, which is currently commit 72522c2.
$ super -version && super -f parquet -o gha-super.parquet -c 'fuse' 2023-02-08-*.json.gz
Version: v0.1.0-37-g72522c227
parquetio: unsupported type: not implemented: support for DENSE_UNION