-
Notifications
You must be signed in to change notification settings - Fork 4.1k
[R] unify_schemas=FALSE does not improve open_dataset() read times #33312
Copy link
Copy link
Open
Labels
Description
open_dataset() provides the very helpful optional argument to set unify_schemas=FALSE, which should allow arrow to inspect a single parquet file instead of touching potentially thousands or more parquet files to determine a consistent unified schema. This ought to provide a substantial performance increase in contexts where the schema is known in advance.
Unfortunately, in my tests it seems to have no impact on performance. Consider the following reprexes:
default, unify_schemas=TRUE
library(arrow)
ex <- s3_bucket("neon4cast-scores/parquet/terrestrial_30min", endpoint_override = "data.ecoforecast.org", anonymous=TRUE)
bench::bench_time(
{ open_dataset(ex) }
)about 32 seconds for me.
manual, unify_schemas=FALSE:
bench::bench_time({
open_dataset(ex, unify_schemas = FALSE)
})takes about 32 seconds as well.
Reporter: Carl Boettiger / @cboettig
Note: This issue was originally created as ARROW-18114. Please see the migration documentation for further details.
Reactions are currently unavailable