[R] unify_schemas=FALSE does not improve open_dataset() read times

open_dataset() provides the very helpful optional argument to set unify_schemas=FALSE, which should allow arrow to inspect a single parquet file instead of touching potentially thousands or more parquet files to determine a consistent unified schema.  This ought to provide a substantial performance increase in contexts where the schema is known in advance.

Unfortunately, in my tests it seems to have no impact on performance.  Consider the following reprexes:

 default, unify_schemas=TRUE 
```java

library(arrow)
ex <- s3_bucket("neon4cast-scores/parquet/terrestrial_30min", endpoint_override = "data.ecoforecast.org", anonymous=TRUE)
bench::bench_time(
{ open_dataset(ex) }
)
```
about 32 seconds for me.

 manual, unify_schemas=FALSE:  
```java

bench::bench_time({
open_dataset(ex, unify_schemas = FALSE)
})
```
takes about 32 seconds as well. 

**Reporter**: [Carl Boettiger](https://issues.apache.org/jira/browse/ARROW-18114) / @cboettig

<sub>**Note**: *This issue was originally created as [ARROW-18114](https://issues.apache.org/jira/browse/ARROW-18114). Please see the [migration documentation](https://github.com/apache/arrow/issues/14542) for further details.*</sub>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[R] unify_schemas=FALSE does not improve open_dataset() read times #33312

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[R] unify_schemas=FALSE does not improve open_dataset() read times #33312

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions