Skip to content

[R] dplyr::glimpse method for arrow table and datasets #32109

@asfimport

Description

@asfimport

When working with Arrow datasets/tables, I often find myself wanting to interactively print or "see" the results of a query or the first few rows of the data without having to fully collect into memory.

I can perform exploratory data analysis on large out-of-memory datasets via Arrow + dplyr but in order to print the returned values I have to collect() into memory or send to_duckdb().

  • compute() - returns number of rows/columns, but no data

  • collect() - returns data fully into memory, can be combined with head()

  • to_duckdb() - keeps data out of memory, always returns top 10 rows and all columns, optionally increase/decrease number of printed rows

    While to_duckdb() gives me the ability to do true EDA, it seems counterintuitive to need to send the arrow table over to a duckdb database just to see the glimpse()/head() equivalent.

    My feature request is that there is a dplyr::glimpse() method that will lazily print the first few values of table/dataset. The expected output would be something like the below.

    library(dplyr)
    library(arrow)
    
    mtcars %>% arrow::write_parquet("mtcars.parquet")
    car_ds <- arrow::open_dataset("mtcars.parquet")
    
    car_ds %>% 
      glimpse()
    
    Rows: ??
    Columns: 11
    $ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, …
    $ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, …
    $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 36$ hp   <dbl> 110, 110, 93, 110, 175, 105, 2$ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, …
    $ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.$ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17$ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, …
    $ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, …
    $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, …
    $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, …

    Currently glimpse() will return a list output where the majority of the output is erroneous to the actual data/values.

    library(dplyr)
    library(arrow)
    
    mtcars %>% arrow::write_parquet("mtcars.parquet")
    car_ds <- arrow::open_dataset("mtcars.parquet")
    
    car_ds %>% 
      glimpse()

#> Classes 'FileSystemDataset', 'Dataset', 'ArrowObject', 'R6'
#>   Inherits from:
#>   Public:
#>     .:xp:.: externalptr
#>     .class_title: function () 
#>     clone: function (deep = FALSE) 
#>     files: active binding
#>     filesystem: active binding
#>     format: active binding
#>     initialize: function (xp) 
#>     invalidate: function () 
#>     metadata: active binding
#>     NewScan: function () 
#>     num_cols: active binding
#>     num_rows: active binding
#>     pointer: function () 
#>     print: function (...) 
#>     schema: active binding
#>     set_pointer: function (xp) 
#>     ToString: function () 
#>     type: active binding

car_ds %>%
  filter(cyl == 6) %>%
  glimpse()
#> List of 7
#>  $ mpg :Classes 'FileSystemDataset', 'Dataset', 'ArrowObject', 'R6'
#>   Inherits from:
#>   Public:
#>     .:xp:.: externalptr
#>     .class_title: function () 
#>     clone: function (deep = FALSE) 
#>     files: active binding
#>     filesystem: active binding
#>     format: active binding
#>     initialize: function (xp) 
#>     invalidate: function () 
#>     metadata: active binding
#>     NewScan: function () 
#>     num_cols: active binding
#>     num_rows: active binding
#>     pointer: function () 
#>     print: function (...) 
#>     schema: active binding
#>     set_pointer: function (xp) 
#>     ToString: function () 
#>     type: active binding 
#>  $ cyl :List of 11
#>   ..$ mpg :Classes 'Expression', 'ArrowObject', 'R6'
#>   Inherits from:
#>   Public:
#>     .:xp:.: externalptr
#>     cast: function (to_type, safe = TRUE, ...) 
#>     clone: function (deep = FALSE) 
#>     Equals: function (other, ...) 
#>     field_name: active binding
#>     initialize: function (xp) 
#>     invalidate: function () 
#>     pointer: function () 
#>     print: function (...) 
#>     schema: Schema, ArrowObject, R6
#>     set_pointer: function (xp) 
#>     ToString: function () 
#>     type: function (schema = self$schema) 
#>     type_id: function (schema = self$schema)  
#>   ..$ cyl :Classes 'Expression', 'ArrowObject', 'R6'
#>   Inherits from:
#>   Public:
#>     .:xp:.: externalptr
#>     cast: function (to_type, safe = TRUE, ...) 
#>     clone: function (deep = FALSE) 
#>     Equals: function (other, ...) 
#>     field_name: active binding
#>     initialize: function (xp) 
#>     invalidate: function () 
#>     pointer: function () 
#>     print: function (...) 
#>     schema: Schema, ArrowObject, R6
#>     set_pointer: function (xp) 
#>     ToString: function () 
#>     type: function (schema = self$schema) 
#>     type_id: function (schema = self$schema)  
#>   ..$ disp:Classes 'Expression', 'ArrowObject', 'R6'
#>   Inherits from:
#>   Public:
#>     .:xp:.: externalptr
#>     cast: function (to_type, safe = TRUE, ...) 
#>     clone: function (deep = FALSE) 
#>     Equals: function (other, ...) 
#>     field_name: active binding
#>     initialize: function (xp) 
#>     invalidate: function () 
#>     pointer: function () 
#>     print: function (...) 
#>     schema: Schema, ArrowObject, R6
#>     set_pointer: function (xp) 
#>     ToString: function () 
#>     type: function (schema = self$schema) 
#>     type_id: function (schema = self$schema)  
#>   ..$ hp  :Classes 'Expression', 'ArrowObject', 'R6'
#>   Inherits from:
#>   Public:
#>     .:xp:.: externalptr
#>     cast: function (to_type, safe = TRUE, ...) 
#>     clone: function (deep = FALSE) 
#>     Equals: function (other, ...) 
#>     field_name: active binding
#>     initialize: function (xp) 
#>     invalidate: function () 
#>     pointer: function () 
#>     print: function (...) 
#>     schema: Schema, ArrowObject, R6
#>     set_pointer: function (xp) 
#>     ToString: function () 
#>     type: function (schema = self$schema) 
#>     type_id: function (schema = self$schema)  
#>   ..$ drat:Classes 'Expression', 'ArrowObject', 'R6'
#>   Inherits from:
#>   Public:
#>     .:xp:.: externalptr
#>     cast: function (to_type, safe = TRUE, ...) 
#>     clone: function (deep = FALSE) 
#>     Equals: function (other, ...) 
#>     field_name: active binding
#>     initialize: function (xp) 
#>     invalidate: function () 
#>     pointer: function () 
#>     print: function (...) 
#>     schema: Schema, ArrowObject, R6
#>     set_pointer: function (xp) 
#>     ToString: function () 
#>     type: function (schema = self$schema) 
#>     type_id: function (schema = self$schema)  
#>   ..$ wt  :Classes 'Expression', 'ArrowObject', 'R6'
#>   Inherits from:
#>   Public:
#>     .:xp:.: externalptr
#>     cast: function (to_type, safe = TRUE, ...) 
#>     clone: function (deep = FALSE) 
#>     Equals: function (other, ...) 
#>     field_name: active binding
#>     initialize: function (xp) 
#>     invalidate: function () 
#>     pointer: function () 
#>     print: function (...) 
#>     schema: Schema, ArrowObject, R6
#>     set_pointer: function (xp) 
#>     ToString: function () 
#>     type: function (schema = self$schema) 
#>     type_id: function (schema = self$schema)  
#>   ..$ qsec:Classes 'Expression', 'ArrowObject', 'R6'
#>   Inherits from:
#>   Public:
#>     .:xp:.: externalptr
#>     cast: function (to_type, safe = TRUE, ...) 
#>     clone: function (deep = FALSE) 
#>     Equals: function (other, ...) 
#>     field_name: active binding
#>     initialize: function (xp) 
#>     invalidate: function () 
#>     pointer: function () 
#>     print: function (...) 
#>     schema: Schema, ArrowObject, R6
#>     set_pointer: function (xp) 
#>     ToString: function () 
#>     type: function (schema = self$schema) 
#>     type_id: function (schema = self$schema)  
#>   ..$ vs  :Classes 'Expression', 'ArrowObject', 'R6'
#>   Inherits from:
#>   Public:
#>     .:xp:.: externalptr
#>     cast: function (to_type, safe = TRUE, ...) 
#>     clone: function (deep = FALSE) 
#>     Equals: function (other, ...) 
#>     field_name: active binding
#>     initialize: function (xp) 
#>     invalidate: function () 
#>     pointer: function () 
#>     print: function (...) 
#>     schema: Schema, ArrowObject, R6
#>     set_pointer: function (xp) 
#>     ToString: function () 
#>     type: function (schema = self$schema) 
#>     type_id: function (schema = self$schema)  
#>   ..$ am  :Classes 'Expression', 'ArrowObject', 'R6'
#>   Inherits from:
#>   Public:
#>     .:xp:.: externalptr
#>     cast: function (to_type, safe = TRUE, ...) 
#>     clone: function (deep = FALSE) 
#>     Equals: function (other, ...) 
#>     field_name: active binding
#>     initialize: function (xp) 
#>     invalidate: function () 
#>     pointer: function () 
#>     print: function (...) 
#>     schema: Schema, ArrowObject, R6
#>     set_pointer: function (xp) 
#>     ToString: function () 
#>     type: function (schema = self$schema) 
#>     type_id: function (schema = self$schema)  
#>   ..$ gear:Classes 'Expression', 'ArrowObject', 'R6'
#>   Inherits from:
#>   Public:
#>     .:xp:.: externalptr
#>     cast: function (to_type, safe = TRUE, ...) 
#>     clone: function (deep = FALSE) 
#>     Equals: function (other, ...) 
#>     field_name: active binding
#>     initialize: function (xp) 
#>     invalidate: function () 
#>     pointer: function () 
#>     print: function (...) 
#>     schema: Schema, ArrowObject, R6
#>     set_pointer: function (xp) 
#>     ToString: function () 
#>     type: function (schema = self$schema) 
#>     type_id: function (schema = self$schema)  
#>   ..$ carb:Classes 'Expression', 'ArrowObject', 'R6'
#>   Inherits from:
#>   Public:
#>     .:xp:.: externalptr
#>     cast: function (to_type, safe = TRUE, ...) 
#>     clone: function (deep = FALSE) 
#>     Equals: function (other, ...) 
#>     field_name: active binding
#>     initialize: function (xp) 
#>     invalidate: function () 
#>     pointer: function () 
#>     print: function (...) 
#>     schema: Schema, ArrowObject, R6
#>     set_pointer: function (xp) 
#>     ToString: function () 
#>     type: function (schema = self$schema) 
#>     type_id: function (schema = self$schema)  
#>  $ disp:Classes 'Expression', 'ArrowObject', 'R6'
#>   Inherits from:
#>   Public:
#>     .:xp:.: externalptr
#>     cast: function (to_type, safe = TRUE, ...) 
#>     clone: function (deep = FALSE) 
#>     Equals: function (other, ...) 
#>     field_name: active binding
#>     initialize: function (xp) 
#>     invalidate: function () 
#>     pointer: function () 
#>     print: function (...) 
#>     schema: Schema, ArrowObject, R6
#>     set_pointer: function (xp) 
#>     ToString: function () 
#>     type: function (schema = self$schema) 
#>     type_id: function (schema = self$schema)  
#>  $ hp  : chr(0) 
#>  $ drat: NULL
#>  $ wt  : list()
#>  $ qsec: logi(0) 
#>  - attr(*, "class")= chr "arrow_dplyr_query"


<sup>Created on 2022-06-07 by the [reprex package](https://reprex.tidyverse.org) (v2.0.1)</sup>

**Reporter**: [Thomas Mock](https://issues.apache.org/jira/browse/ARROW-16776)
**Assignee**: [Neal Richardson](https://issues.apache.org/jira/browse/ARROW-16776) / @nealrichardson
#### Related issues:
- [[R] printing data in Table/RecordBatch print method](https://github.com/apache/arrow/issues/32110) (Blocked)
#### PRs and other links:
- [GitHub Pull Request #13563](https://github.com/apache/arrow/pull/13563)

<sub>**Note**: *This issue was originally created as [ARROW-16776](https://issues.apache.org/jira/browse/ARROW-16776). Please see the [migration documentation](https://github.com/apache/arrow/issues/14542) for further details.*</sub>

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions