-
Notifications
You must be signed in to change notification settings - Fork 4.1k
[Python] Specifying schema does not prevent arrow from reading metadata on every single parquet? #34145
Description
Describe the bug, including details regarding any error messages, version, and platform.
Consider the following reprex, in which we open a partitioned parquet dataset on a remote S3 bucket:
import pyarrow.dataset as ds
from pyarrow import fs
from timebudget import timebudget
@timebudget
def without_schema():
s3 = fs.S3FileSystem(endpoint_override = "data.ecoforecast.org", anonymous = True)
df = ds.dataset(
"neon4cast-forecasts/parquet/terrestrial_30min",
format="parquet",
filesystem=s3
)
return(df)
df = without_schema()
schema = df.schemaThis takes a whooping 102 seconds on my machine. I believe most of the computation is associated with checking the metadata found in each parquet file, since there are many individual partitions in this data. This process should not be necessary though if we provide the schema:
@timebudget
def with_schema():
s3 = fs.S3FileSystem(endpoint_override = "data.ecoforecast.org", anonymous = True)
df2 = ds.dataset(
"neon4cast-forecasts/parquet/terrestrial_30min",
format="parquet",
filesystem=s3,
schema = schema
)
return(df2)
with_schema()But observe the execution time is once again 102 seconds. Note that if we manually specify a single partition, the process takes only 2.7 seconds:
@timebudget
def single():
s3 = fs.S3FileSystem(endpoint_override = "data.ecoforecast.org", anonymous = True)
df = ds.dataset(
"neon4cast-forecasts/parquet/terrestrial_30min/model_id=climatology/reference_datetime=2022-12-01 00:00:00/date=2022-12-02",
format="parquet",
filesystem=s3
)
return(df)
single()I would have expected similar performance between these two cases: ds.dataset() should just be establishing a connection with the schema and not reading any data. After all part of the promise of partitioned data is that we could open the dataset at the root of the parquet db and rely on filtering operations to extract a specific subset without the code ever touching all the other parquet files.
My guess here is that arrow is trying to read all the parquet file metadata to ensure they all match the schema, even though that is not the expected behavior. I think this is the same issue as seen in R #33312. But maybe I'm not doing something correct or miss-understand the expected behavior?
Component(s)
Parquet, Python