Skip to content

Create ArrowReaderMetadata from externalized metadata #5582

@kylebarron

Description

@kylebarron

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

In some multi-file Parquet dataset layouts, there is a sidecar metadata file, canonically named _metadata, which holds only the metadata for each row group in the dataset. See https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-metadata-files:

Some processing frameworks such as Spark or Dask (optionally) use _metadata and _common_metadata files with partitioned datasets.

Those files include information about the schema of the full dataset (for _common_metadata) and potentially all row group metadata of all files in the partitioned dataset as well (for _metadata). The actual files are metadata-only Parquet files. Note this is not a Parquet standard, but a convention set in practice by those frameworks.

Using those files can give a more efficient creation of a parquet Dataset, since it can use the stored schema and file paths of all row groups, instead of inferring the schema and crawling the directories for all Parquet files (this is especially the case for filesystems where accessing files is expensive).

I'd like to be able to use such metadata files to accelerate reading of Parquet datasets in geoarrow-rs. Mimicking pyarrow's API, I currently have a ParquetFile struct, which is backed by a single R: AsyncFileReader, as well as a ParquetDataset struct, which is backed by Vec<ParquetFile<R>>, where R: AsyncFileReader. This allows concurrent async reads across multiple files.

I'd like to have a ParquetDataset::from_metadata method, which constructs itself from a _metadata file. But to do that I need to be able to construct ArrowReaderMetadata for each underlying file. This is entirely possible with existing APIs, except that ArrowReaderMetadata::try_new has visibility pub(crate).

Describe the solution you'd like

Give ArrowReaderMetadata::try_new full public visibility.

Describe alternatives you've considered

Unsure of alternatives.

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementAny new improvement worthy of a entry in the changelogparquetChanges to the parquet crate

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions