Skip to content

No longer able to output fused GHA data as Parquet #6650

@philrz

Description

@philrz

We've often shown examples of querying GitHub Archive data with super. Running fuse on the original JSON, outputting it as Parquet, and working with it as Parquet has been one of the approaches shown to compare/contrast with table-based legacy SQL approaches. Correlated with the changes in #6633, attempts to output the fuse'd JSON as Parquet have started failing.

Details

Repro is with super commit cbb4109 (which is associated with the merge of the changes in #6633) and subsequent commit 4dbc2fe, which return different errors.

To download the JSON GitHub Archive data:

$ for num in $(seq 0 23); do file="2023-02-08-${num}.json.gz"; curl -L -O "https://data.gharchive.org/$file"; done

The command that worked in the past (such as at commit 1bced44 right before the merge of #6633) that now fails at commit cbb4109:

$ super -version && super -f parquet -o gha-super.parquet -c 'fuse' 2023-02-08-*.json.gz
Version: v0.1.0-23-gcbb41094c
parquetio: not a record: error("missing field \"commits\" is not nullable")

Then at the very next commit 4dbc2fe, there's a different error.

$ super -version && super -f parquet -o gha-super.parquet -c 'fuse' 2023-02-08-*.json.gz
Version: v0.1.0-24-g4dbc2fe29
parquetio: unsupported type: not implemented: support for DENSE_UNION

As a user, it's not clear to me if the first error is still present but "hiding" behind the second.

In any case, the symptom remains through to current tip of main, which is currently commit 72522c2.

$ super -version && super -f parquet -o gha-super.parquet -c 'fuse' 2023-02-08-*.json.gz
Version: v0.1.0-37-g72522c227
parquetio: unsupported type: not implemented: support for DENSE_UNION

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions