Skip to content

[R] unify_schemas=FALSE does not improve open_dataset() read times #33312

@asfimport

Description

@asfimport

open_dataset() provides the very helpful optional argument to set unify_schemas=FALSE, which should allow arrow to inspect a single parquet file instead of touching potentially thousands or more parquet files to determine a consistent unified schema.  This ought to provide a substantial performance increase in contexts where the schema is known in advance.

Unfortunately, in my tests it seems to have no impact on performance.  Consider the following reprexes:

 default, unify_schemas=TRUE 

library(arrow)
ex <- s3_bucket("neon4cast-scores/parquet/terrestrial_30min", endpoint_override = "data.ecoforecast.org", anonymous=TRUE)
bench::bench_time(
{ open_dataset(ex) }
)

about 32 seconds for me.

 manual, unify_schemas=FALSE:  

bench::bench_time({
open_dataset(ex, unify_schemas = FALSE)
})

takes about 32 seconds as well. 

Reporter: Carl Boettiger / @cboettig

Note: This issue was originally created as ARROW-18114. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions